Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Fix-217) Monitor external mount and handle appropriately if mount flickers or fails #224

Merged
merged 21 commits into from
Sep 9, 2020

Conversation

vindard
Copy link
Contributor

@vindard vindard commented Sep 6, 2020

I'm opening a draft PR to share my progress and thinking on this so far. The changes here could potentially work as an immediate solution to this problem, but this still needs to be tested (I haven't had the time to test as yet, and I don't want to block this issue).

Problem

The issue (#217) we're tackling here is being able to properly handle a faulty external drive connection/mount.

Users have reported this happening from a range of reasons including underpowering from power supplies and incompatible SSD adapters. The general behaviour seems to be that the externally mounted files become either entirely unavailable, or partially unavailable when the drive is put into a read-only state.

Solution approach

In all the failure modes observed/reported, it seems that a common solution could be to monitor the mounted drive somehow and trigger some set of remediation actions if the drive becomes unavailable.

My approach so far has been to:

  • abstract the scripts/backup/monitor script into a common scripts/monitor-lib script that can be sourced

  • add a 2nd scripts/umbrel-os/external-storage/monitor script that:

    1. creates a dummy file on the external mount as the last step in the mount process

    2. monitors that dummy file for any changes (file permissions, availabilty etc.) through the monitor_file function

    3. triggers a monitor-check script if any file changes are detected in the dummy file

  • call the scripts/umbrel-os/external-storage/monitor script from the mount script which is run as a system service in umbrel-os at startup

  • for now, have the monitor-check trigger script simply re-run the mount script (or restart the external-storage service?)
    It seems that the mount script already has all the required handling to also attempt a re-mount of the drive and then take the appropriate actions if unsuccessful. Alternatively, this monitor-check script is the place where I'd place any other handling steps for external mount failures.


Still to be done

  1. Check if it'll be better to restart the external-storage instead of calling mount again from the mount-check trigger

  2. Test that the monitor-lib abstraction didn't break the existing scripts/backup/monitor script

  3. Test that the new scripts/umbrel-os/external-storage/monitor function works properly to monitor the changes we want to track on the mount-checkfile dummy file on the external drive; also check that this dummy file monitoring is a good proxy for the general desired state of the external mount

Add a dummy file to the external mount as part of the mounting
process and kick off a process to monitor that file for any
changes. File changes should trigger karen to run a script
that does some checks to establish whether the mount is ok
or if umbrel should be stopped.
@nolim1t
Copy link
Contributor

nolim1t commented Sep 7, 2020

Concept ACK

@nolim1t nolim1t linked an issue Sep 7, 2020 that may be closed by this pull request
@lukechilds
Copy link
Member

Thanks so much for working on this @vindard.

A couple of notes:

I'm not sure watching for changes to a file is the best way to check for changes to the underlying filesystem. It's possible that unmounting the entire filesystem will not trigger events on every file on the filesystem, so could not trigger fswatch. I haven't tested this, it might still work, but I think it's better to directly monitor the filesystem mount status since that's exactly what we want to catch.

Also I don't think we should attempt to restart Umbrel. If the storage device is acting unreliably I don't think we should attempt to continue running. We should completely stop Umbrel until the reliability issues are resolved. Otherwise the user could end up running for weeks without realising anything's wrong. If their drive is unreliable they could lose data at any moment. Their node would be constantly going down and then coming back up. It's better to fail loud and hard and let them resolve the storage issues properly.

Also if the storage device is not acting reliably, it could be quite dangerous to run the mount script again. If we don't detect an Umbrel installation we format the device. So if the device contains an installation but we can't detect it due to power issues, we may end up wiping an active installation.

Great initiative with extracting the monitor function out into a sourceable file though. We've ended up with quite a bit of code duplication throughout the shell scripts, we need to go though and clean this up a bit much like how you have done for the monitor function in this PR.

I think in the interest of time to get a working fix out quickly without touching any other functionality a simple reliable solution would be a single new bash script with the following logic:

  • While loop
  • Check /mnt/data is mounted and not read only
  • If not, stop Umbrel

@vindard
Copy link
Contributor Author

vindard commented Sep 7, 2020

@lukechilds thanks for the review!

I'm not sure watching for changes to a file is the best way to check for changes to the underlying filesystem. It's possible that unmounting the entire filesystem will not trigger events on every file on the filesystem, so could not trigger fswatch. I haven't tested this, it might still work, but I think it's better to directly monitor the filesystem mount status since that's exactly what we want to catch.

Fair enough, I haven't been able to test this properly myself so can't confirm either. This was actually one of the open questions in my head (whether monitoring a file would be reliable enough).

Also if the storage device is not acting reliably, it could be quite dangerous to run the mount script again. If we don't detect an Umbrel installation we format the device. So if the device contains an installation but we can't detect it due to power issues, we may end up wiping an active installation.

I think in the interest of time to get a working fix out quickly without touching any other functionality a simple reliable solution would be a single new bash script with the following logic:

  • While loop
  • Check /mnt/data is mounted and not read only
  • If not, stop Umbrel

Ya true, I think this makes sense and is probably a much easier approach implementation- & testing-wise too.

Great initiative with extracting the monitor function out into a sourceable file though. We've ended up with quite a bit of code duplication throughout the shell scripts, we need to go though and clean this up a bit much like how you have done for the monitor function in this PR.

Thanks 😅 ... I took an open swing with that one to see how it would look in the code and if it would make things easier. Great to hear that there's also some cleanup work planned along these lines!


Planned changes

Ok so a while loop checking the fs for mount state changes should be easy enough to do. My open questions would be:

  1. How we stop
    I'm assuming we prefer $ systemctl stop umbrel-startup over calling the stop script directly?

  2. Persisting some detected faulty mount flag
    Do we persist the fact that we detected a faulty mount somehow so that if the user restarts the OS we can use that persisted data to know that we shouldn't start umbrel? The alternative (simpler) would be to only have this while loop apply to each running OS instance. This would mean that if the device is restarted then we start everything over as usual, and start monitoring for faults again with no knowledge of previous fault detections.

  3. User input if we decide to persist
    If we do persist that faulty mount flag, what are you all thinking on how we can let the user signal back to us that they've fixed the problem and would like to try running umbrel again?

Of course this whole "persisting" thing may also be outside the issue scope, and could be a later thing to tackle (it may be in line with Luke's error docker page suggestion)

@mayankchhabra
Copy link
Member

Thanks for the PR, @vindard! Genuinely impressed by the speed you caught up with the codebase. 🙌

Overall I agree that explicitly monitoring mount status would be a better choice.

Re your questions:

How we stop

I think systemctl stop umbrel-startup would be better suited in this case since:

  • The mounting/unmounting logic is present on only Umbrel OS, which utilizes systemd services, unlike manual Umbrel installations that utilize start/stop scripts.
  • systemd would also automatically kill all the forked child processes of start script (eg. karen, backup monitor, etc), which the stop script currently doesn't kill.

Persisting some detected faulty mount flag and user input if we decide to persist

I agree with this:

The alternative (simpler) would be to only have this while loop apply to each running OS instance. This would mean that if the device is restarted then we start everything over as usual, and start monitoring for faults again with no knowledge of previous fault detections.

Because we wouldn't know on the next restart if the same problem is going to occur again or not, and letting the user signal that they have fixed the problem could be tricky, especially since we cannot nail down the exact problem and provide them any actionable feedback.

Plus "turning it off and on again" is regarded as the ultimate fix for all problems. 😅

@lukechilds
Copy link
Member

lukechilds commented Sep 8, 2020

I think it should look roughly something like this: (note this is completely untested 😅)

#!/usr/bin/env bash

set -euo pipefail

UMBREL_ROOT="$(readlink -f $(dirname "${BASH_SOURCE[0]}")/../../..)"
MOUNT_POINT="/mnt/data"

check_if_not_already_running() {
  if ps ax | grep $0 | grep -v $$ | grep bash | grep -v grep
  then
    echo "storage monitor is already running"
    exit 1
  fi
}

main () {
  check_if_not_already_running

  while true; do
    echo "Checking Umbrel root is bind mounted to external storage..."
    if ! df -h "${UMBREL_ROOT}" | grep --quiet '/dev/sd'; then
      break
    fi

    echo "Checking external storage is mounted..."
    mount_data=$(mount | grep " ${MOUNT_POINT} ")
    if [[ "$(echo ${mount_data} | wc -l)" == "0" ]]; then
      break
    fi

    echo "Checking external storage is not read only..."
    if ! echo "${mount_data}" | awk -F '[()]' '{print $2}' | grep --quiet 'ro'; then
      break
    fi
    
    sleep 1
  done

  echo "Check failed, stopping Umbrel..."
  systemctl stop umbrel-startup
}

main

There's probably a better command to use than mount, ideally that you can pass a mount point to instead of greping through it's output.

Also | grep 'ro' is probably not safe since it would also match stuff in the brackets like errors=remount-ro so you'd need to check for an exact ro match (e.g either side of ro is either a comma or start/end of line).

This commit switches from writing a file to the mount
and monitoring that file to instead checkin the mount
status directly
@vindard
Copy link
Contributor Author

vindard commented Sep 8, 2020

I think it should look roughly something like this: (note this is completely untested sweat_smile)

@lukechilds lol I literally just pushed some new commits that I think are pretty much along these lines 😂

My changes are also untested on an umbrel-os instance. I won't have time to get to testing until tomorrow though (it's GMT-4, 12:15am here), but if anyone gets to it before me feel free to go ahead 🙈

@lukechilds
Copy link
Member

Awesome!

I'm also away atm and don't have access to a physical device to test. Should be back home later tonight though so I'll have a play.

@vindard vindard marked this pull request as ready for review September 8, 2020 04:24
events/triggers/mount-fail Outdated Show resolved Hide resolved

# 'mount-fail' gets triggered if 'check_mount' fails
trigger="mount-fail"
touch "${UMBREL_ROOT}/events/signals/${trigger}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a good idea to directly do systemctl stop umbrel-startup here since we know at this point the mount has failed, which means we're either writing this signal file to the SD card (which probably won't trigger karen as it was originally watching the signals directory on mounted fs) or we might not be able to touch this signal file at all if the drive is remounted as read-only.

Copy link
Contributor Author

@vindard vindard Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point! I hadn't thought of all things that would no longer be available if that mount goes down. Making that change back in a bit.

Edit: changed here

The entire umbrel fs (including triggers folder) may
not be available during a mount failure, meaning that
the trigger mechanism would no longer work
@vindard
Copy link
Contributor Author

vindard commented Sep 8, 2020

I left in the monitor-lib separation since I had already implemented that for the monitor_file route. It does make the diff messy though and I'm not sure if that's a structure you guys would want, so if we want we can also back that out and just have the additional things in the new external-storage/monitor script

Copy link
Member

@lukechilds lukechilds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be best to not worry about refactoring the file monitor functionality out into a separate file for now.

Let keep this a super simple change so we can quickly test and review the device monitor functionality without having to worry about testing backup monitor functionality too.

We can do the bash refactor in a separate future PR.

scripts/umbrel-os/external-storage/monitor Outdated Show resolved Hide resolved
scripts/umbrel-os/external-storage/monitor Outdated Show resolved Hide resolved
scripts/umbrel-os/external-storage/monitor Outdated Show resolved Hide resolved
scripts/umbrel-os/external-storage/monitor Outdated Show resolved Hide resolved
scripts/umbrel-os/external-storage/monitor Outdated Show resolved Hide resolved
@vindard
Copy link
Contributor Author

vindard commented Sep 9, 2020

@lukechilds @mayankchhabra hey folks, little update on my testing so far. It looks like I ended up in a little catch-22 when trying to stop the umbrel service. If we call systemctl stop umbrel-startup, this doesn't run because the stop script is no longer available at the mount when the mount errors.

If we remove the bind mount with umount ${UMBREL_ROOT} then we can safely call systemctl stop umbrel-startup because the sd-card version of the umbrel dir becomes available again, but we can't remove that bind mount until all the long running processes spawned from the mount are killed. The external-storage monitor script that we're trying to do all this from is in itself a long running process initially spawned from that bind mount, which is where our catch-22 comes in 🤦‍♂️

Outside of the scripts, if I do systemctl stop umbrel-external-storage I can then umount ${UMBREL_ROOT} the bind mount and safely stop the umbrel service with systemctl stop umbrel-startup, but I currently can't do these steps from inside the monitor script.


Potential solution

The only way I can think of around this is to have some script hosted at a non-umbrel location that doesn't get affected by mounting/unmounting that we can call out to to safely bring everything down. Ideally the monitor would trigger that non-umbrel-hosted script that would then do 3 things:

  • stop umbrel-external-storage to kill long running processes (optional if the new monitor exits properly on its own)
  • unmount the bind mount to get back local access to files
  • stop umbrel-startup (or if this fails call the stop script directly from the SD-card version of things)

How does this sound to you all, and does anyone have any other ideas on how we might get around this?


Edit 1: 2nd longer-term fix for this could be to set things up to only copy data-specific things to the external drive but keep all co-ordination code on the SD card. Bigger thing that can be tackled in later PRs though.

Also randomly related to this, there's this weird edge case where systemd starts with sd-card hosted files and then switches to external drive version of things once the bind mount comes in, and then when you're troubleshooting you can no longer see/edit the files that got bind-mounted over (which def'ly made troubleshooting things trickier 🙈).

@vindard
Copy link
Contributor Author

vindard commented Sep 9, 2020

Let keep this a super simple change so we can quickly test and review the device monitor functionality without having to worry about testing backup monitor functionality too

Ya I think that makes sense. Certainly helps make reviewing things much quicker too!

@lukechilds
Copy link
Member

lukechilds commented Sep 9, 2020

Just something to bear in mind, after some testing and managing to recreate the systemd state of re-mounting the drive in read only mode:

$ ls /dev/sd*
/dev/sdb  /dev/sdb1  /dev/sdc  /dev/sdc1
$ mount | grep MyPassport
/dev/sda1 on /mnt/MyPassport type ext4 (ro,relatime,stripe=256)
$ cat /proc/mounts | grep MyPassport
/dev/sda1 /mnt/MyPassport ext4 ro,relatime,stripe=256 0 0

notice /dev/sda no longer exists, the block device is now at /dev/sdc but mount and /proc/mounts still show /dev/sda. /sys/block/sda also gets moved to /sys/block/sdc.

Checking the sdc block device and partition does not show them as read only:

$ cat /sys/block/sdc/ro
0
$ cat /sys/block/sdc/sdc1/ro
0

So this is not a reliable check. I think this might be to show if the physical device/partition is set as read only, as opposed to if it's mounted as read only.

However the mount output check proposed here #224 (comment) is still reliable.

@lukechilds
Copy link
Member

lukechilds commented Sep 9, 2020

Outside of the scripts, if I do systemctl stop umbrel-external-storage I can then umount ${UMBREL_ROOT} the bind mount and safely stop the umbrel service with systemctl stop umbrel-startup, but I currently can't do these steps from inside the monitor script.

stop umbrel-external-storage to kill long running processes (optional if the new monitor exits properly on its own)

I'm not sure I'm following here, there are no long running processes that will be killed by systemctl stop umbrel-external-storage. It's just a one off script that runs on boot: https://github.com/getumbrel/umbrel/blob/master/scripts/umbrel-os/external-storage/mount

How does this sound to you all, and does anyone have any other ideas on how we might get around this?

I think maybe a better solution might be to completely forget about what is/isn't mounted and just kill stuff with docker directly. We know that will always work, regardless of wether Umbrel/compose config/scripts etc are available.

So monitor mount status, if something goes wrong, reliably kill everything with docker stop $(docker ps -aq).

then when you're troubleshooting you can no longer see/edit the files that got bind-mounted over (which def'ly made troubleshooting things trickier 🙈).

You can view the underlying root filesystem without the $UMBREL_ROOT bind mount at /sd-root

@lukechilds
Copy link
Member

@vindard I have this working and tested based on your branch here: vindard#1

If you can merge that we can go ahead and merge this!

External storage monitor tweaks.
Copy link
Member

@mayankchhabra mayankchhabra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and everything worked flawlessly! Great job, @vindard and @lukechilds! 🎉

@vindard
Copy link
Contributor Author

vindard commented Sep 9, 2020

I think maybe a better solution might be to completely forget about what is/isn't mounted and just kill stuff with docker directly. We know that will always work, regardless of wether Umbrel/compose config/scripts etc are available.

So monitor mount status, if something goes wrong, reliably kill everything with docker stop $(docker ps -aq).

Ok yea that would do it lol!

I was trying to do the direct docker stop as well, but (like a noob) was doing it as docker-compose down which obviously wouldn't have worked without the docker-compose.yml file mounted. It totally slipped me to go to the docker processes directly 🙈

@vindard
Copy link
Contributor Author

vindard commented Sep 9, 2020

You can view the underlying root filesystem without the $UMBREL_ROOT bind mount at /sd-root

You know I actually saw this bind mount happening as well when I was first reviewing the code and was wondering why you guys were doing that. Makes sense now!

@lukechilds lukechilds merged commit af4066d into getumbrel:master Sep 9, 2020
@lukechilds
Copy link
Member

Merged, thanks for helping out with this @vindard!

@vindard vindard deleted the issue-217/monitor-mount branch September 9, 2020 14:04
@lukechilds lukechilds mentioned this pull request Sep 9, 2020
lukechilds added a commit that referenced this pull request Sep 9, 2020
This update brings some minor bugfixes and adds monitoring functionality to ensure your external storage device is running reliably.

Changes:

Stop Umbrel if external storage device is unreliable (#224) af4066d (Thanks @vindard for helping out with this)
Prevent start script from failing on no hidden service file (#226) 95a0dd8
Bump dashboard to v0.3.7 (#227) eea7208
Diff: v0.2.9...v0.2.10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop Umbrel if external storage is unmounted on Umbrel OS
4 participants