(Fix-217) Monitor external mount and handle appropriately if mount flickers or fails #224

vindard · 2020-09-06T18:25:18Z

I'm opening a draft PR to share my progress and thinking on this so far. The changes here could potentially work as an immediate solution to this problem, but this still needs to be tested (I haven't had the time to test as yet, and I don't want to block this issue).

Problem

The issue (#217) we're tackling here is being able to properly handle a faulty external drive connection/mount.

Users have reported this happening from a range of reasons including underpowering from power supplies and incompatible SSD adapters. The general behaviour seems to be that the externally mounted files become either entirely unavailable, or partially unavailable when the drive is put into a read-only state.

Solution approach

In all the failure modes observed/reported, it seems that a common solution could be to monitor the mounted drive somehow and trigger some set of remediation actions if the drive becomes unavailable.

My approach so far has been to:

abstract the scripts/backup/monitor script into a common scripts/monitor-lib script that can be sourced
add a 2nd scripts/umbrel-os/external-storage/monitor script that:
1. creates a dummy file on the external mount as the last step in the mount process
2. monitors that dummy file for any changes (file permissions, availabilty etc.) through the monitor_file function
3. triggers a monitor-check script if any file changes are detected in the dummy file
call the scripts/umbrel-os/external-storage/monitor script from the mount script which is run as a system service in umbrel-os at startup
for now, have the monitor-check trigger script simply re-run the mount script (or restart the external-storage service?)
It seems that the mount script already has all the required handling to also attempt a re-mount of the drive and then take the appropriate actions if unsuccessful. Alternatively, this monitor-check script is the place where I'd place any other handling steps for external mount failures.

Still to be done

Check if it'll be better to restart the external-storage instead of calling mount again from the mount-check trigger
Test that the monitor-lib abstraction didn't break the existing scripts/backup/monitor script
Test that the new scripts/umbrel-os/external-storage/monitor function works properly to monitor the changes we want to track on the mount-checkfile dummy file on the external drive; also check that this dummy file monitoring is a good proxy for the general desired state of the external mount

Add a dummy file to the external mount as part of the mounting process and kick off a process to monitor that file for any changes. File changes should trigger karen to run a script that does some checks to establish whether the mount is ok or if umbrel should be stopped.

nolim1t · 2020-09-07T06:09:03Z

Concept ACK

lukechilds · 2020-09-07T06:45:42Z

Thanks so much for working on this @vindard.

A couple of notes:

I'm not sure watching for changes to a file is the best way to check for changes to the underlying filesystem. It's possible that unmounting the entire filesystem will not trigger events on every file on the filesystem, so could not trigger fswatch. I haven't tested this, it might still work, but I think it's better to directly monitor the filesystem mount status since that's exactly what we want to catch.

Also I don't think we should attempt to restart Umbrel. If the storage device is acting unreliably I don't think we should attempt to continue running. We should completely stop Umbrel until the reliability issues are resolved. Otherwise the user could end up running for weeks without realising anything's wrong. If their drive is unreliable they could lose data at any moment. Their node would be constantly going down and then coming back up. It's better to fail loud and hard and let them resolve the storage issues properly.

Also if the storage device is not acting reliably, it could be quite dangerous to run the mount script again. If we don't detect an Umbrel installation we format the device. So if the device contains an installation but we can't detect it due to power issues, we may end up wiping an active installation.

Great initiative with extracting the monitor function out into a sourceable file though. We've ended up with quite a bit of code duplication throughout the shell scripts, we need to go though and clean this up a bit much like how you have done for the monitor function in this PR.

I think in the interest of time to get a working fix out quickly without touching any other functionality a simple reliable solution would be a single new bash script with the following logic:

While loop
Check /mnt/data is mounted and not read only
If not, stop Umbrel

vindard · 2020-09-07T12:12:59Z

@lukechilds thanks for the review!

I'm not sure watching for changes to a file is the best way to check for changes to the underlying filesystem. It's possible that unmounting the entire filesystem will not trigger events on every file on the filesystem, so could not trigger fswatch. I haven't tested this, it might still work, but I think it's better to directly monitor the filesystem mount status since that's exactly what we want to catch.

Fair enough, I haven't been able to test this properly myself so can't confirm either. This was actually one of the open questions in my head (whether monitoring a file would be reliable enough).

Also if the storage device is not acting reliably, it could be quite dangerous to run the mount script again. If we don't detect an Umbrel installation we format the device. So if the device contains an installation but we can't detect it due to power issues, we may end up wiping an active installation.

I think in the interest of time to get a working fix out quickly without touching any other functionality a simple reliable solution would be a single new bash script with the following logic:

While loop

Check /mnt/data is mounted and not read only

If not, stop Umbrel

Ya true, I think this makes sense and is probably a much easier approach implementation- & testing-wise too.

Great initiative with extracting the monitor function out into a sourceable file though. We've ended up with quite a bit of code duplication throughout the shell scripts, we need to go though and clean this up a bit much like how you have done for the monitor function in this PR.

Thanks 😅 ... I took an open swing with that one to see how it would look in the code and if it would make things easier. Great to hear that there's also some cleanup work planned along these lines!

Planned changes

Ok so a while loop checking the fs for mount state changes should be easy enough to do. My open questions would be:

How we stop
I'm assuming we prefer $ systemctl stop umbrel-startup over calling the stop script directly?
Persisting some detected faulty mount flag
Do we persist the fact that we detected a faulty mount somehow so that if the user restarts the OS we can use that persisted data to know that we shouldn't start umbrel? The alternative (simpler) would be to only have this while loop apply to each running OS instance. This would mean that if the device is restarted then we start everything over as usual, and start monitoring for faults again with no knowledge of previous fault detections.
User input if we decide to persist
If we do persist that faulty mount flag, what are you all thinking on how we can let the user signal back to us that they've fixed the problem and would like to try running umbrel again?

Of course this whole "persisting" thing may also be outside the issue scope, and could be a later thing to tackle (it may be in line with Luke's error docker page suggestion)

mayankchhabra · 2020-09-07T17:56:38Z

Thanks for the PR, @vindard! Genuinely impressed by the speed you caught up with the codebase. 🙌

Overall I agree that explicitly monitoring mount status would be a better choice.

Re your questions:

How we stop

I think systemctl stop umbrel-startup would be better suited in this case since:

The mounting/unmounting logic is present on only Umbrel OS, which utilizes systemd services, unlike manual Umbrel installations that utilize start/stop scripts.
systemd would also automatically kill all the forked child processes of start script (eg. karen, backup monitor, etc), which the stop script currently doesn't kill.

Persisting some detected faulty mount flag and user input if we decide to persist

I agree with this:

The alternative (simpler) would be to only have this while loop apply to each running OS instance. This would mean that if the device is restarted then we start everything over as usual, and start monitoring for faults again with no knowledge of previous fault detections.

Because we wouldn't know on the next restart if the same problem is going to occur again or not, and letting the user signal that they have fixed the problem could be tricky, especially since we cannot nail down the exact problem and provide them any actionable feedback.

Plus "turning it off and on again" is regarded as the ultimate fix for all problems. 😅

lukechilds · 2020-09-08T04:05:12Z

I think it should look roughly something like this: (note this is completely untested 😅)

#!/usr/bin/env bash

set -euo pipefail

UMBREL_ROOT="$(readlink -f $(dirname "${BASH_SOURCE[0]}")/../../..)"
MOUNT_POINT="/mnt/data"

check_if_not_already_running() {
  if ps ax | grep $0 | grep -v $$ | grep bash | grep -v grep
  then
    echo "storage monitor is already running"
    exit 1
  fi
}

main () {
  check_if_not_already_running

  while true; do
    echo "Checking Umbrel root is bind mounted to external storage..."
    if ! df -h "${UMBREL_ROOT}" | grep --quiet '/dev/sd'; then
      break
    fi

    echo "Checking external storage is mounted..."
    mount_data=$(mount | grep " ${MOUNT_POINT} ")
    if [[ "$(echo ${mount_data} | wc -l)" == "0" ]]; then
      break
    fi

    echo "Checking external storage is not read only..."
    if ! echo "${mount_data}" | awk -F '[()]' '{print $2}' | grep --quiet 'ro'; then
      break
    fi
    
    sleep 1
  done

  echo "Check failed, stopping Umbrel..."
  systemctl stop umbrel-startup
}

main

There's probably a better command to use than mount, ideally that you can pass a mount point to instead of greping through it's output.

Also | grep 'ro' is probably not safe since it would also match stuff in the brackets like errors=remount-ro so you'd need to check for an exact ro match (e.g either side of ro is either a comma or start/end of line).

This commit switches from writing a file to the mount and monitoring that file to instead checkin the mount status directly

vindard · 2020-09-08T04:12:28Z

I think it should look roughly something like this: (note this is completely untested sweat_smile)

@lukechilds lol I literally just pushed some new commits that I think are pretty much along these lines 😂

My changes are also untested on an umbrel-os instance. I won't have time to get to testing until tomorrow though (it's GMT-4, 12:15am here), but if anyone gets to it before me feel free to go ahead 🙈

lukechilds · 2020-09-08T04:19:45Z

Awesome!

I'm also away atm and don't have access to a physical device to test. Should be back home later tonight though so I'll have a play.

events/triggers/mount-fail

scripts/umbrel-os/external-storage/monitor

mayankchhabra · 2020-09-08T09:54:12Z

scripts/umbrel-os/external-storage/monitor

+
+    # 'mount-fail' gets triggered if 'check_mount' fails
+    trigger="mount-fail"
+    touch "${UMBREL_ROOT}/events/signals/${trigger}"


It may be a good idea to directly do systemctl stop umbrel-startup here since we know at this point the mount has failed, which means we're either writing this signal file to the SD card (which probably won't trigger karen as it was originally watching the signals directory on mounted fs) or we might not be able to touch this signal file at all if the drive is remounted as read-only.

Oh good point! I hadn't thought of all things that would no longer be available if that mount goes down. Making that change back in a bit.

Edit: changed here

The entire umbrel fs (including triggers folder) may not be available during a mount failure, meaning that the trigger mechanism would no longer work

vindard · 2020-09-08T13:45:08Z

I left in the monitor-lib separation since I had already implemented that for the monitor_file route. It does make the diff messy though and I'm not sure if that's a structure you guys would want, so if we want we can also back that out and just have the additional things in the new external-storage/monitor script

This needs to be the last service stopped in the external mount monitor script since the script itself will be killed when that line is run

lukechilds

I think it might be best to not worry about refactoring the file monitor functionality out into a separate file for now.

Let keep this a super simple change so we can quickly test and review the device monitor functionality without having to worry about testing backup monitor functionality too.

We can do the bash refactor in a separate future PR.

scripts/umbrel-os/external-storage/monitor

vindard · 2020-09-09T06:51:55Z

@lukechilds @mayankchhabra hey folks, little update on my testing so far. It looks like I ended up in a little catch-22 when trying to stop the umbrel service. If we call systemctl stop umbrel-startup, this doesn't run because the stop script is no longer available at the mount when the mount errors.

If we remove the bind mount with umount ${UMBREL_ROOT} then we can safely call systemctl stop umbrel-startup because the sd-card version of the umbrel dir becomes available again, but we can't remove that bind mount until all the long running processes spawned from the mount are killed. The external-storage monitor script that we're trying to do all this from is in itself a long running process initially spawned from that bind mount, which is where our catch-22 comes in 🤦‍♂️

Outside of the scripts, if I do systemctl stop umbrel-external-storage I can then umount ${UMBREL_ROOT} the bind mount and safely stop the umbrel service with systemctl stop umbrel-startup, but I currently can't do these steps from inside the monitor script.

Potential solution

The only way I can think of around this is to have some script hosted at a non-umbrel location that doesn't get affected by mounting/unmounting that we can call out to to safely bring everything down. Ideally the monitor would trigger that non-umbrel-hosted script that would then do 3 things:

stop umbrel-external-storage to kill long running processes (optional if the new monitor exits properly on its own)
unmount the bind mount to get back local access to files
stop umbrel-startup (or if this fails call the stop script directly from the SD-card version of things)

How does this sound to you all, and does anyone have any other ideas on how we might get around this?

Edit 1: 2nd longer-term fix for this could be to set things up to only copy data-specific things to the external drive but keep all co-ordination code on the SD card. Bigger thing that can be tackled in later PRs though.

Also randomly related to this, there's this weird edge case where systemd starts with sd-card hosted files and then switches to external drive version of things once the bind mount comes in, and then when you're troubleshooting you can no longer see/edit the files that got bind-mounted over (which def'ly made troubleshooting things trickier 🙈).

vindard · 2020-09-09T06:58:49Z

Let keep this a super simple change so we can quickly test and review the device monitor functionality without having to worry about testing backup monitor functionality too

Ya I think that makes sense. Certainly helps make reviewing things much quicker too!

lukechilds · 2020-09-09T07:17:29Z

Just something to bear in mind, after some testing and managing to recreate the systemd state of re-mounting the drive in read only mode:

$ ls /dev/sd*
/dev/sdb  /dev/sdb1  /dev/sdc  /dev/sdc1
$ mount | grep MyPassport
/dev/sda1 on /mnt/MyPassport type ext4 (ro,relatime,stripe=256)
$ cat /proc/mounts | grep MyPassport
/dev/sda1 /mnt/MyPassport ext4 ro,relatime,stripe=256 0 0

notice /dev/sda no longer exists, the block device is now at /dev/sdc but mount and /proc/mounts still show /dev/sda. /sys/block/sda also gets moved to /sys/block/sdc.

Checking the sdc block device and partition does not show them as read only:

$ cat /sys/block/sdc/ro
0
$ cat /sys/block/sdc/sdc1/ro
0

So this is not a reliable check. I think this might be to show if the physical device/partition is set as read only, as opposed to if it's mounted as read only.

However the mount output check proposed here #224 (comment) is still reliable.

lukechilds · 2020-09-09T07:35:54Z

Outside of the scripts, if I do systemctl stop umbrel-external-storage I can then umount ${UMBREL_ROOT} the bind mount and safely stop the umbrel service with systemctl stop umbrel-startup, but I currently can't do these steps from inside the monitor script.

stop umbrel-external-storage to kill long running processes (optional if the new monitor exits properly on its own)

I'm not sure I'm following here, there are no long running processes that will be killed by systemctl stop umbrel-external-storage. It's just a one off script that runs on boot: https://github.com/getumbrel/umbrel/blob/master/scripts/umbrel-os/external-storage/mount

How does this sound to you all, and does anyone have any other ideas on how we might get around this?

I think maybe a better solution might be to completely forget about what is/isn't mounted and just kill stuff with docker directly. We know that will always work, regardless of wether Umbrel/compose config/scripts etc are available.

So monitor mount status, if something goes wrong, reliably kill everything with docker stop $(docker ps -aq).

then when you're troubleshooting you can no longer see/edit the files that got bind-mounted over (which def'ly made troubleshooting things trickier 🙈).

You can view the underlying root filesystem without the $UMBREL_ROOT bind mount at /sd-root

lukechilds · 2020-09-09T12:41:34Z

@vindard I have this working and tested based on your branch here: vindard#1

If you can merge that we can go ahead and merge this!

External storage monitor tweaks.

mayankchhabra

Tested and everything worked flawlessly! Great job, @vindard and @lukechilds! 🎉

vindard · 2020-09-09T13:31:20Z

I think maybe a better solution might be to completely forget about what is/isn't mounted and just kill stuff with docker directly. We know that will always work, regardless of wether Umbrel/compose config/scripts etc are available.

So monitor mount status, if something goes wrong, reliably kill everything with docker stop $(docker ps -aq).

Ok yea that would do it lol!

I was trying to do the direct docker stop as well, but (like a noob) was doing it as docker-compose down which obviously wouldn't have worked without the docker-compose.yml file mounted. It totally slipped me to go to the docker processes directly 🙈

vindard · 2020-09-09T13:33:54Z

You can view the underlying root filesystem without the $UMBREL_ROOT bind mount at /sd-root

You know I actually saw this bind mount happening as well when I was first reviewing the code and was wondering why you guys were doing that. Makes sense now!

lukechilds · 2020-09-09T14:04:08Z

Merged, thanks for helping out with this @vindard!

@vindard

This update brings some minor bugfixes and adds monitoring functionality to ensure your external storage device is running reliably. Changes: Stop Umbrel if external storage device is unreliable (#224) af4066d (Thanks @vindard for helping out with this) Prevent start script from failing on no hidden service file (#226) 95a0dd8 Bump dashboard to v0.3.7 (#227) eea7208 Diff: v0.2.9...v0.2.10

vindard added 3 commits September 6, 2020 13:00

split 'monitor' script into 'lib' & 'call' scripts

38d3cea

re-run 'mount' script if mount falters

d0f313d

vindard force-pushed the issue-217/monitor-mount branch from 8711e7f to d0f313d Compare September 6, 2020 19:36

nolim1t linked an issue Sep 7, 2020 that may be closed by this pull request

Stop Umbrel if external storage is unmounted on Umbrel OS #217

Closed

vindard added 2 commits September 7, 2020 23:06

stop Umbrel if mount-check is triggered

bd137b2

refactor monitor-lib script

35b0394

switch mount check from 'mounted file' check

6758ef1

This commit switches from writing a file to the mount and monitoring that file to instead checkin the mount status directly

vindard force-pushed the issue-217/monitor-mount branch from aa29dfb to 6758ef1 Compare September 8, 2020 04:11

vindard marked this pull request as ready for review September 8, 2020 04:24

vindard commented Sep 8, 2020

View reviewed changes

events/triggers/mount-fail Outdated Show resolved Hide resolved

vindard commented Sep 8, 2020

View reviewed changes

scripts/umbrel-os/external-storage/monitor Outdated Show resolved Hide resolved

add 'bind-mount' check to external drive monitor script

c92da89

vindard force-pushed the issue-217/monitor-mount branch from 2880282 to c92da89 Compare September 8, 2020 04:27

mayankchhabra reviewed Sep 8, 2020

View reviewed changes

remove fail trigger; stop umbrel in monitor script

4ccb761

The entire umbrel fs (including triggers folder) may not be available during a mount failure, meaning that the trigger mechanism would no longer work

vindard added 5 commits September 9, 2020 00:26

add check for if device is unsafely unplugged

0c1b1ae

fix wrongly set script path in 'mount'

0433c95

change 'monitor' script permissions to allow execution

6fb0e1a

fix wrongly set UMBREL_ROOT in 'monitor'

ad5fc58

move 'umbrel-external-storage' service stop to be last step

f3d357d

This needs to be the last service stopped in the external mount monitor script since the script itself will be killed when that line is run

lukechilds requested changes Sep 9, 2020

View reviewed changes

lukechilds added 7 commits September 9, 2020 17:56

Remove changes external to storage monitor

c60bc9d

Make device status checks more reliable

7322ee6

Stop all docker containers globally

2bc8846

Use docker kill instead of docker stop

3e0ab3d

Only log on error to reduce log output

395c599

Run most catastrophic test first for more useful log output

66246f0

Fix bad wordcount test

9a4ab5b

lukechilds mentioned this pull request Sep 9, 2020

External storage monitor tweaks. vindard/umbrel#1

Merged

Merge pull request #1 from lukechilds/monitor-storage

20dda35

External storage monitor tweaks.

mayankchhabra approved these changes Sep 9, 2020

View reviewed changes

lukechilds approved these changes Sep 9, 2020

View reviewed changes

lukechilds merged commit af4066d into getumbrel:master Sep 9, 2020

vindard deleted the issue-217/monitor-mount branch September 9, 2020 14:04

lukechilds mentioned this pull request Sep 9, 2020

Umbrel v0.2.10 #228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Fix-217) Monitor external mount and handle appropriately if mount flickers or fails #224

(Fix-217) Monitor external mount and handle appropriately if mount flickers or fails #224

vindard commented Sep 6, 2020 •

edited

nolim1t commented Sep 7, 2020

lukechilds commented Sep 7, 2020

vindard commented Sep 7, 2020 •

edited

mayankchhabra commented Sep 7, 2020

lukechilds commented Sep 8, 2020 •

edited

vindard commented Sep 8, 2020 •

edited

lukechilds commented Sep 8, 2020

mayankchhabra Sep 8, 2020 •

edited

vindard Sep 8, 2020 •

edited

vindard commented Sep 8, 2020 •

edited

lukechilds left a comment

vindard commented Sep 9, 2020 •

edited

vindard commented Sep 9, 2020

lukechilds commented Sep 9, 2020 •

edited

lukechilds commented Sep 9, 2020 •

edited

lukechilds commented Sep 9, 2020

mayankchhabra left a comment

vindard commented Sep 9, 2020

vindard commented Sep 9, 2020 •

edited

lukechilds commented Sep 9, 2020

(Fix-217) Monitor external mount and handle appropriately if mount flickers or fails #224

(Fix-217) Monitor external mount and handle appropriately if mount flickers or fails #224

Conversation

vindard commented Sep 6, 2020 • edited

Problem

Solution approach

Still to be done

nolim1t commented Sep 7, 2020

lukechilds commented Sep 7, 2020

vindard commented Sep 7, 2020 • edited

Planned changes

mayankchhabra commented Sep 7, 2020

lukechilds commented Sep 8, 2020 • edited

vindard commented Sep 8, 2020 • edited

lukechilds commented Sep 8, 2020

mayankchhabra Sep 8, 2020 • edited

Choose a reason for hiding this comment

vindard Sep 8, 2020 • edited

Choose a reason for hiding this comment

vindard commented Sep 8, 2020 • edited

lukechilds left a comment

Choose a reason for hiding this comment

vindard commented Sep 9, 2020 • edited

Potential solution

vindard commented Sep 9, 2020

lukechilds commented Sep 9, 2020 • edited

lukechilds commented Sep 9, 2020 • edited

lukechilds commented Sep 9, 2020

mayankchhabra left a comment

Choose a reason for hiding this comment

vindard commented Sep 9, 2020

vindard commented Sep 9, 2020 • edited

lukechilds commented Sep 9, 2020

vindard commented Sep 6, 2020 •

edited

vindard commented Sep 7, 2020 •

edited

lukechilds commented Sep 8, 2020 •

edited

vindard commented Sep 8, 2020 •

edited

mayankchhabra Sep 8, 2020 •

edited

vindard Sep 8, 2020 •

edited

vindard commented Sep 8, 2020 •

edited

vindard commented Sep 9, 2020 •

edited

lukechilds commented Sep 9, 2020 •

edited

lukechilds commented Sep 9, 2020 •

edited

vindard commented Sep 9, 2020 •

edited