-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Creating a unmount_operation with safety checks for nasbackup.sh #12133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache CloudStack community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md)
|
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses intermittent backup failures when using CIFS (SMB) repositories by introducing a new umount_operation() function that waits for mount points to become idle before attempting to unmount. The key improvement is adding a polling mechanism with timeout to handle race conditions where CIFS doesn't immediately flush filesystem handles.
- Adds
umount_operation()function with 10-second timeout and busy-wait logic usingfuser - Replaces direct
umount+rmdircalls withumount_operation()in backup functions - Adds error logging and warning messages for unmount failures
Comments suppressed due to low confidence (1)
scripts/vm/hypervisor/kvm/nasbackup.sh:206
- The
delete_backupandget_backup_statsfunctions still use the old unmount pattern (umount $mount_pointfollowed byrmdir $mount_point) instead of the newumount_operation()function. For consistency and to apply the same safety checks across all operations, these functions should also useumount_operation().
delete_backup() {
mount_operation
rm -frv $dest
sync
umount $mount_point
rmdir $mount_point
}
get_backup_stats() {
mount_operation
echo $mount_point
df -P $mount_point 2>/dev/null | awk 'NR==2 {print $2, $3}'
umount $mount_point
rmdir $mount_point
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| exit 1 | ||
| fi | ||
| } | ||
|
|
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The umount_operation() function lacks documentation explaining its purpose, behavior, and return value. Consider adding a comment block describing: (1) that it waits up to 10 seconds for the mount point to become idle, (2) that it attempts to unmount and remove the directory, and (3) its error handling behavior (currently does not fail the script on unmount failure).
| # umount_operation | |
| # ---------------- | |
| # Waits up to 10 seconds for the mount point to become idle (not in use). | |
| # Attempts to unmount the specified mount point. If unmounting succeeds, | |
| # removes the mount directory. If unmounting fails, logs a warning and | |
| # does not remove the directory. This function does not cause the script | |
| # to exit on unmount failure; the script continues execution. |
| else | ||
| echo "Warning: failed to unmount $mount_point, skipping rmdir" | ||
| echo "umount error message: $umount_output" | ||
| log -ne "Warning: failed to unmount $mount_point, error: $umount_output" |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function does not return a non-zero exit code when unmount fails. Since the script uses set -eo pipefail at the top, and this function temporarily disables errors with set +e, a failed unmount will not cause the script to exit with an error status. This means backup jobs will report success even when the unmount fails, which contradicts the PR's goal of fixing "random backup failures." Consider adding return 1 or exit 1 in the else branch (lines 244-248) to ensure proper error propagation.
| log -ne "Warning: failed to unmount $mount_point, error: $umount_output" | |
| log -ne "Warning: failed to unmount $mount_point, error: $umount_output" | |
| return 1 |
| elapsed=0 | ||
| while fuser -m "$mount_point" >/dev/null 2>&1 && (( elapsed < 10 )); do | ||
| sleep 1 | ||
| elapsed=$((elapsed + 1)) | ||
| done | ||
|
|
||
| # Check if timeout was reached | ||
| if (( elapsed >= 10 )); then |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout value 10 is hardcoded twice in the function (line 225 and line 231). If the timeout needs to be changed in the future, it must be updated in both places, increasing the risk of inconsistency. Consider using a local variable at the start of the function (e.g., local timeout=10) and referencing it in both locations.
| elapsed=0 | |
| while fuser -m "$mount_point" >/dev/null 2>&1 && (( elapsed < 10 )); do | |
| sleep 1 | |
| elapsed=$((elapsed + 1)) | |
| done | |
| # Check if timeout was reached | |
| if (( elapsed >= 10 )); then | |
| local timeout=10 | |
| elapsed=0 | |
| while fuser -m "$mount_point" >/dev/null 2>&1 && (( elapsed < timeout )); do | |
| sleep 1 | |
| elapsed=$((elapsed + 1)) | |
| done | |
| # Check if timeout was reached | |
| if (( elapsed >= timeout )); then |
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## main #12133 +/- ##
=============================================
- Coverage 17.56% 3.58% -13.98%
=============================================
Files 5912 445 -5467
Lines 529383 37536 -491847
Branches 64660 6901 -57759
=============================================
- Hits 92984 1347 -91637
+ Misses 425941 36025 -389916
+ Partials 10458 164 -10294
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15829 |
|
@Hanarion do you want this on v23 or on the next LTS iteration of 20 or 22? (main will not go in/on those) |
|
@DaanHoogland I personally uses the latest version, and it seems that in 4.20 only NFS is supported anyway, so |
ok, how about 22.1, though? (rebase on the 4.22 branch) |
Up to you! I'm not very familiar yet with how things usually work in the project, so I'm fine with it if you think rebasing on the 4.22 branch is the better option. |
for LTS releases we have release branches that get merged forwards once in a while. For the recent 22.0 release we created 4.20 as release branch. I think your fix is legit to add to that branch. |
Description
Fixes: #12122
This PR resolves the random backup failures observed when using a CIFS (SMB) backup repository with NAS backup. The original issue describes how backups appear to complete — files transferred, file remaining = 0 — but the job ends in status FAILED because the subsequent sync + umount step blocks: the mount point remains busy and cannot unmount cleanly.
What was happening:
After the data copy, the script issues sync but because CIFS doesn’t always flush/close all filesystem handles immediately, the mount remains busy.
The script attempting umount $mount_point fails (“target is busy”), the mount and directory remain, leaving resources dangling and causing job to fail even though the backup data is present.
The issue is intermittent (“sometimes it fails, sometimes it doesn’t”) due to timing/race conditions with CIFS.
What this PR implements:
Adds a polling loop (e.g., using fuser ‑m <mount_point>) with a timeout to wait for any active handles on the mount to clear before attempting umount.
If the mount remains busy past the timeout, we show an error text, and still try to umount (We never know, it may work if we are lucky)
We also ensures that on backups of stopped VMs, the umount is also triggered
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
I ran multiple tests by directly calling the script and checking the return code while blocking the umount :
How did you try to break this feature and the system with this change?
This change should not break anything as it simply fix the wrong return code when umount fails, and add more details in stdout and logs