-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Fix-217) Monitor external mount and handle appropriately if mount flickers or fails #224
Conversation
Add a dummy file to the external mount as part of the mounting process and kick off a process to monitor that file for any changes. File changes should trigger karen to run a script that does some checks to establish whether the mount is ok or if umbrel should be stopped.
8711e7f
to
d0f313d
Compare
Concept ACK |
Thanks so much for working on this @vindard. A couple of notes: I'm not sure watching for changes to a file is the best way to check for changes to the underlying filesystem. It's possible that unmounting the entire filesystem will not trigger events on every file on the filesystem, so could not trigger fswatch. I haven't tested this, it might still work, but I think it's better to directly monitor the filesystem mount status since that's exactly what we want to catch. Also I don't think we should attempt to restart Umbrel. If the storage device is acting unreliably I don't think we should attempt to continue running. We should completely stop Umbrel until the reliability issues are resolved. Otherwise the user could end up running for weeks without realising anything's wrong. If their drive is unreliable they could lose data at any moment. Their node would be constantly going down and then coming back up. It's better to fail loud and hard and let them resolve the storage issues properly. Also if the storage device is not acting reliably, it could be quite dangerous to run the mount script again. If we don't detect an Umbrel installation we format the device. So if the device contains an installation but we can't detect it due to power issues, we may end up wiping an active installation. Great initiative with extracting the monitor function out into a sourceable file though. We've ended up with quite a bit of code duplication throughout the shell scripts, we need to go though and clean this up a bit much like how you have done for the monitor function in this PR. I think in the interest of time to get a working fix out quickly without touching any other functionality a simple reliable solution would be a single new bash script with the following logic:
|
@lukechilds thanks for the review!
Fair enough, I haven't been able to test this properly myself so can't confirm either. This was actually one of the open questions in my head (whether monitoring a file would be reliable enough).
Ya true, I think this makes sense and is probably a much easier approach implementation- & testing-wise too.
Thanks 😅 ... I took an open swing with that one to see how it would look in the code and if it would make things easier. Great to hear that there's also some cleanup work planned along these lines! Planned changesOk so a
Of course this whole "persisting" thing may also be outside the issue scope, and could be a later thing to tackle (it may be in line with Luke's error docker page suggestion) |
Thanks for the PR, @vindard! Genuinely impressed by the speed you caught up with the codebase. 🙌 Overall I agree that explicitly monitoring mount status would be a better choice. Re your questions:
I think
I agree with this:
Because we wouldn't know on the next restart if the same problem is going to occur again or not, and letting the user signal that they have fixed the problem could be tricky, especially since we cannot nail down the exact problem and provide them any actionable feedback. Plus "turning it off and on again" is regarded as the ultimate fix for all problems. 😅 |
I think it should look roughly something like this: (note this is completely untested 😅) #!/usr/bin/env bash
set -euo pipefail
UMBREL_ROOT="$(readlink -f $(dirname "${BASH_SOURCE[0]}")/../../..)"
MOUNT_POINT="/mnt/data"
check_if_not_already_running() {
if ps ax | grep $0 | grep -v $$ | grep bash | grep -v grep
then
echo "storage monitor is already running"
exit 1
fi
}
main () {
check_if_not_already_running
while true; do
echo "Checking Umbrel root is bind mounted to external storage..."
if ! df -h "${UMBREL_ROOT}" | grep --quiet '/dev/sd'; then
break
fi
echo "Checking external storage is mounted..."
mount_data=$(mount | grep " ${MOUNT_POINT} ")
if [[ "$(echo ${mount_data} | wc -l)" == "0" ]]; then
break
fi
echo "Checking external storage is not read only..."
if ! echo "${mount_data}" | awk -F '[()]' '{print $2}' | grep --quiet 'ro'; then
break
fi
sleep 1
done
echo "Check failed, stopping Umbrel..."
systemctl stop umbrel-startup
}
main There's probably a better command to use than Also |
This commit switches from writing a file to the mount and monitoring that file to instead checkin the mount status directly
aa29dfb
to
6758ef1
Compare
@lukechilds lol I literally just pushed some new commits that I think are pretty much along these lines 😂 My changes are also untested on an umbrel-os instance. I won't have time to get to testing until tomorrow though (it's GMT-4, 12:15am here), but if anyone gets to it before me feel free to go ahead 🙈 |
Awesome! I'm also away atm and don't have access to a physical device to test. Should be back home later tonight though so I'll have a play. |
2880282
to
c92da89
Compare
|
||
# 'mount-fail' gets triggered if 'check_mount' fails | ||
trigger="mount-fail" | ||
touch "${UMBREL_ROOT}/events/signals/${trigger}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a good idea to directly do systemctl stop umbrel-startup
here since we know at this point the mount has failed, which means we're either writing this signal file to the SD card (which probably won't trigger karen as it was originally watching the signals directory on mounted fs) or we might not be able to touch this signal file at all if the drive is remounted as read-only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good point! I hadn't thought of all things that would no longer be available if that mount goes down. Making that change back in a bit.
Edit: changed here
The entire umbrel fs (including triggers folder) may not be available during a mount failure, meaning that the trigger mechanism would no longer work
I left in the |
This needs to be the last service stopped in the external mount monitor script since the script itself will be killed when that line is run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be best to not worry about refactoring the file monitor functionality out into a separate file for now.
Let keep this a super simple change so we can quickly test and review the device monitor functionality without having to worry about testing backup monitor functionality too.
We can do the bash refactor in a separate future PR.
@lukechilds @mayankchhabra hey folks, little update on my testing so far. It looks like I ended up in a little catch-22 when trying to stop the umbrel service. If we call If we remove the bind mount with Outside of the scripts, if I do Potential solutionThe only way I can think of around this is to have some script hosted at a non-
How does this sound to you all, and does anyone have any other ideas on how we might get around this? Edit 1: 2nd longer-term fix for this could be to set things up to only copy data-specific things to the external drive but keep all co-ordination code on the SD card. Bigger thing that can be tackled in later PRs though. Also randomly related to this, there's this weird edge case where systemd starts with sd-card hosted files and then switches to external drive version of things once the bind mount comes in, and then when you're troubleshooting you can no longer see/edit the files that got bind-mounted over (which def'ly made troubleshooting things trickier 🙈). |
Ya I think that makes sense. Certainly helps make reviewing things much quicker too! |
Just something to bear in mind, after some testing and managing to recreate the systemd state of re-mounting the drive in read only mode:
notice /dev/sda no longer exists, the block device is now at /dev/sdc but Checking the sdc block device and partition does not show them as read only:
So this is not a reliable check. I think this might be to show if the physical device/partition is set as read only, as opposed to if it's mounted as read only. However the |
I'm not sure I'm following here, there are no long running processes that will be killed by
I think maybe a better solution might be to completely forget about what is/isn't mounted and just kill stuff with docker directly. We know that will always work, regardless of wether Umbrel/compose config/scripts etc are available. So monitor mount status, if something goes wrong, reliably kill everything with
You can view the underlying root filesystem without the $UMBREL_ROOT bind mount at |
External storage monitor tweaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and everything worked flawlessly! Great job, @vindard and @lukechilds! 🎉
Ok yea that would do it lol! I was trying to do the direct docker stop as well, but (like a noob) was doing it as |
You know I actually saw this bind mount happening as well when I was first reviewing the code and was wondering why you guys were doing that. Makes sense now! |
Merged, thanks for helping out with this @vindard! |
This update brings some minor bugfixes and adds monitoring functionality to ensure your external storage device is running reliably. Changes: Stop Umbrel if external storage device is unreliable (#224) af4066d (Thanks @vindard for helping out with this) Prevent start script from failing on no hidden service file (#226) 95a0dd8 Bump dashboard to v0.3.7 (#227) eea7208 Diff: v0.2.9...v0.2.10
I'm opening a draft PR to share my progress and thinking on this so far. The changes here could potentially work as an immediate solution to this problem, but this still needs to be tested (I haven't had the time to test as yet, and I don't want to block this issue).
Problem
The issue (#217) we're tackling here is being able to properly handle a faulty external drive connection/mount.
Users have reported this happening from a range of reasons including underpowering from power supplies and incompatible SSD adapters. The general behaviour seems to be that the externally mounted files become either entirely unavailable, or partially unavailable when the drive is put into a read-only state.
Solution approach
In all the failure modes observed/reported, it seems that a common solution could be to monitor the mounted drive somehow and trigger some set of remediation actions if the drive becomes unavailable.
My approach so far has been to:
abstract the
scripts/backup/monitor
script into a commonscripts/monitor-lib
script that can be sourcedadd a 2nd
scripts/umbrel-os/external-storage/monitor
script that:creates a dummy file on the external mount as the last step in the
mount
processmonitors that dummy file for any changes (file permissions, availabilty etc.) through the
monitor_file
functiontriggers a
monitor-check
script if any file changes are detected in the dummy filecall the
scripts/umbrel-os/external-storage/monitor
script from themount
script which is run as a system service in umbrel-os at startupfor now, have the
monitor-check
trigger script simply re-run themount
script (or restart theexternal-storage
service?)It seems that the
mount
script already has all the required handling to also attempt a re-mount of the drive and then take the appropriate actions if unsuccessful. Alternatively, thismonitor-check
script is the place where I'd place any other handling steps for external mount failures.Still to be done
Check if it'll be better to restart the
external-storage
instead of callingmount
again from themount-check
triggerTest that the
monitor-lib
abstraction didn't break the existingscripts/backup/monitor
scriptTest that the new
scripts/umbrel-os/external-storage/monitor
function works properly to monitor the changes we want to track on themount-checkfile
dummy file on the external drive; also check that this dummy file monitoring is a good proxy for the general desired state of the external mount