Creating a unmount_operation with safety checks for nasbackup.sh #12133

Hanarion · 2025-11-25T12:57:04Z

Description

This PR resolves the random backup failures observed when using a CIFS (SMB) backup repository with NAS backup. The original issue describes how backups appear to complete — files transferred, file remaining = 0 — but the job ends in status FAILED because the subsequent sync + umount step blocks: the mount point remains busy and cannot unmount cleanly.

What was happening:

After the data copy, the script issues sync but because CIFS doesn’t always flush/close all filesystem handles immediately, the mount remains busy.

The script attempting umount $mount_point fails (“target is busy”), the mount and directory remain, leaving resources dangling and causing job to fail even though the backup data is present.

The issue is intermittent (“sometimes it fails, sometimes it doesn’t”) due to timing/race conditions with CIFS.

What this PR implements:

Adds a polling loop (e.g., using fuser ‑m <mount_point>) with a timeout to wait for any active handles on the mount to clear before attempting umount.

If the mount remains busy past the timeout, we show an error text, and still try to umount (We never know, it may work if we are lucky)

We also ensures that on backups of stopped VMs, the umount is also triggered

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
Build/CI
Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

I ran multiple tests by directly calling the script and checking the return code while blocking the umount :

[root@compute01 ~]# /usr/bin/bash /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/nasbackup.sh -o backup -v i-12-606-VM -t cifs -s '/XXXXXX.XXXXX/XXX' -m 'vers=3.0,username=XXXXXX,password=XXXXXX' -p 'i-12-606-VM/test' -q false -d ''

Job type:         Completed   
Operation:        Backup      
Time elapsed:     32208        ms
File processed:   23.000 GiB
File remaining:   0.000 B
File total:       23.000 GiB

2770737887
Timeout for unmounting reached: still busy
Warning: failed to unmount /tmp/csbackup.weorL, skipping rmdir
umount error message: umount: /tmp/csbackup.weorL: target is busy.
[root@compute01 ~]# echo $?
0
[root@compute01 ~]# grep -i unmount /var/log/cloudstack/agent/agent.log
2025-11-25 13-42-17> Warning: failed to unmount /tmp/csbackup.weorL, error: umount: /tmp/csbackup.weorL: target is busy.

How did you try to break this feature and the system with this change?

This change should not break anything as it simply fix the wrong return code when umount fails, and add more details in stdout and logs

boring-cyborg · 2025-11-25T12:57:07Z

Congratulations on your first Pull Request and welcome to the Apache CloudStack community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md)
Here are some useful points:

In case of a new feature add useful documentation (raise doc PR at https://github.com/apache/cloudstack-documentation)
Be patient and persistent. It might take some time to get a review or get the final approval from the committers.
Pay attention to the quality of your code, ensure tests are passing and your PR doesn't have conflicts.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Issues, Mailing list and Slack.
Be sure to read the CloudStack Coding Conventions.
Apache CloudStack is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@cloudstack.apache.org (https://cloudstack.apache.org/mailing-lists.html)
Slack: https://apachecloudstack.slack.com/

sureshanaparti · 2025-11-25T13:14:46Z

@blueorangutan package

blueorangutan · 2025-11-25T13:16:03Z

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copilot

Pull request overview

This PR addresses intermittent backup failures when using CIFS (SMB) repositories by introducing a new umount_operation() function that waits for mount points to become idle before attempting to unmount. The key improvement is adding a polling mechanism with timeout to handle race conditions where CIFS doesn't immediately flush filesystem handles.

Adds umount_operation() function with 10-second timeout and busy-wait logic using fuser
Replaces direct umount + rmdir calls with umount_operation() in backup functions
Adds error logging and warning messages for unmount failures

Comments suppressed due to low confidence (1)

scripts/vm/hypervisor/kvm/nasbackup.sh:206

The delete_backup and get_backup_stats functions still use the old unmount pattern (umount $mount_point followed by rmdir $mount_point) instead of the new umount_operation() function. For consistency and to apply the same safety checks across all operations, these functions should also use umount_operation().

delete_backup() {
  mount_operation

  rm -frv $dest
  sync
  umount $mount_point
  rmdir $mount_point
}

get_backup_stats() {
  mount_operation

  echo $mount_point
  df -P $mount_point 2>/dev/null | awk 'NR==2 {print $2, $3}'
  umount $mount_point
  rmdir $mount_point
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Copilot · 2025-11-25T13:16:27Z

scripts/vm/hypervisor/kvm/nasbackup.sh

      exit 1
  fi
 }



The umount_operation() function lacks documentation explaining its purpose, behavior, and return value. Consider adding a comment block describing: (1) that it waits up to 10 seconds for the mount point to become idle, (2) that it attempts to unmount and remove the directory, and (3) its error handling behavior (currently does not fail the script on unmount failure).

Suggested change

# umount_operation

# ----------------

# Waits up to 10 seconds for the mount point to become idle (not in use).

# Attempts to unmount the specified mount point. If unmounting succeeds,

# removes the mount directory. If unmounting fails, logs a warning and

# does not remove the directory. This function does not cause the script

# to exit on unmount failure; the script continues execution.

Copilot · 2025-11-25T13:16:27Z

scripts/vm/hypervisor/kvm/nasbackup.sh

+  else
+    echo "Warning: failed to unmount $mount_point, skipping rmdir"
+    echo "umount error message: $umount_output"
+    log -ne "Warning: failed to unmount $mount_point, error: $umount_output"


The function does not return a non-zero exit code when unmount fails. Since the script uses set -eo pipefail at the top, and this function temporarily disables errors with set +e, a failed unmount will not cause the script to exit with an error status. This means backup jobs will report success even when the unmount fails, which contradicts the PR's goal of fixing "random backup failures." Consider adding return 1 or exit 1 in the else branch (lines 244-248) to ensure proper error propagation.

Suggested change

log -ne "Warning: failed to unmount $mount_point, error: $umount_output"

log -ne "Warning: failed to unmount $mount_point, error: $umount_output"

return 1

Copilot · 2025-11-25T13:16:28Z

scripts/vm/hypervisor/kvm/nasbackup.sh

+  elapsed=0
+  while fuser -m "$mount_point" >/dev/null 2>&1 && (( elapsed < 10 )); do
+      sleep 1
+      elapsed=$((elapsed + 1))
+  done
+
+  # Check if timeout was reached
+  if (( elapsed >= 10 )); then


The timeout value 10 is hardcoded twice in the function (line 225 and line 231). If the timeout needs to be changed in the future, it must be updated in both places, increasing the risk of inconsistency. Consider using a local variable at the start of the function (e.g., local timeout=10) and referencing it in both locations.

Suggested change

elapsed=0

while fuser -m "$mount_point" >/dev/null 2>&1 && (( elapsed < 10 )); do

sleep 1

elapsed=$((elapsed + 1))

done

# Check if timeout was reached

if (( elapsed >= 10 )); then

local timeout=10

elapsed=0

while fuser -m "$mount_point" >/dev/null 2>&1 && (( elapsed < timeout )); do

sleep 1

elapsed=$((elapsed + 1))

done

# Check if timeout was reached

if (( elapsed >= timeout )); then

codecov · 2025-11-25T13:22:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 3.58%. Comparing base (8171d95) to head (895d09b).

❗ There is a different number of reports uploaded between BASE (8171d95) and HEAD (895d09b). Click for more details.

HEAD has 3 uploads less than BASE

Flag BASE (8171d95) HEAD (895d09b)

uitests 2 1

unittests 2 0

Additional details and impacted files

@@              Coverage Diff              @@
##               main   #12133       +/-   ##
=============================================
- Coverage     17.56%    3.58%   -13.98%     
=============================================
  Files          5912      445     -5467     
  Lines        529383    37536   -491847     
  Branches      64660     6901    -57759     
=============================================
- Hits          92984     1347    -91637     
+ Misses       425941    36025   -389916     
+ Partials      10458      164    -10294

Flag	Coverage Δ
uitests	`3.58% <ø> (ø)`
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

blueorangutan · 2025-11-25T14:36:10Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15829

DaanHoogland · 2025-11-25T14:48:23Z

@Hanarion do you want this on v23 or on the next LTS iteration of 20 or 22? (main will not go in/on those)

Hanarion · 2025-11-25T15:23:19Z

@DaanHoogland I personally uses the latest version, and it seems that in 4.20 only NFS is supported anyway, so sync should work fine in this version

DaanHoogland · 2025-11-26T08:19:26Z

@DaanHoogland I personally uses the latest version, and it seems that in 4.20 only NFS is supported anyway, so sync should work fine in this version

ok, how about 22.1, though? (rebase on the 4.22 branch)

Hanarion · 2025-11-26T08:23:42Z

ok, how about 22.1, though? (rebase on the 4.22 branch)

Up to you! I'm not very familiar yet with how things usually work in the project, so I'm fine with it if you think rebasing on the 4.22 branch is the better option.

DaanHoogland · 2025-11-26T08:50:14Z

ok, how about 22.1, though? (rebase on the 4.22 branch)

Up to you! I'm not very familiar yet with how things usually work in the project, so I'm fine with it if you think rebasing on the 4.22 branch is the better option.

for LTS releases we have release branches that get merged forwards once in a while. For the recent 22.0 release we created 4.20 as release branch. I think your fix is legit to add to that branch.

Creating a unmount_operation with safety checks for nasbackup.sh

895d09b

boring-cyborg bot added the component:kvm label Nov 25, 2025

sureshanaparti requested review from abh1sar and Copilot November 25, 2025 13:13

Copilot started reviewing on behalf of sureshanaparti November 25, 2025 13:14 View session

sureshanaparti added the component:backup label Nov 25, 2025

Copilot finished reviewing on behalf of sureshanaparti November 25, 2025 13:15

Copilot AI reviewed Nov 25, 2025

View reviewed changes

DaanHoogland assigned abh1sar Nov 26, 2025

+# umount_operation
+# ----------------
+# Waits up to 10 seconds for the mount point to become idle (not in use).
+# Attempts to unmount the specified mount point. If unmounting succeeds,
+# removes the mount directory. If unmounting fails, logs a warning and
+# does not remove the directory. This function does not cause the script
+# to exit on unmount failure; the script continues execution.

	log -ne "Warning: failed to unmount $mount_point, error: $umount_output"
	log -ne "Warning: failed to unmount $mount_point, error: $umount_output"
	return 1

Creating a unmount_operation with safety checks for nasbackup.sh #12133

Are you sure you want to change the base?

Creating a unmount_operation with safety checks for nasbackup.sh #12133

Conversation

Hanarion commented Nov 25, 2025

Description

What was happening:

What this PR implements:

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Uh oh!

boring-cyborg bot commented Nov 25, 2025

Uh oh!

sureshanaparti commented Nov 25, 2025

Uh oh!

blueorangutan commented Nov 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 25, 2025

Codecov Report

Uh oh!

blueorangutan commented Nov 25, 2025

Uh oh!

DaanHoogland commented Nov 25, 2025

Uh oh!

Hanarion commented Nov 25, 2025

Uh oh!

DaanHoogland commented Nov 26, 2025

Uh oh!

Hanarion commented Nov 26, 2025

Uh oh!

DaanHoogland commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants