Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FS corruption checks #45

Open
wants to merge 1 commit into
base: master
from
Open

Add FS corruption checks #45

wants to merge 1 commit into from

Conversation

@lekkas
Copy link
Contributor

lekkas commented Aug 12, 2018

Fixes #44

We don't need to specify docker tags, because as 'docker save' docs mention:

Produces a tarred repository to the standard output stream. Contains all parent layers, and all tags + versions, or specified repo:tag, for each argument provided.

Example run

--- CHECKS ---

MEM: OK (24% available.)
DOCKER: OK (docker is running.)
SUPERVISOR: OK (supervisor is running).
DNS: OK (first DNS server is 127.0.0.2.)
DISK: DANGER: BTRFS filesystem not mounted at /var/lib/balena!
METADATA: SKIP: not resinOS 1.x device
FILESYSTEM: Checking docker images for signs of corruption
  - registry2.resin.io/v2/7568b10a00974ac3158e91b6897310d5
  - resin/armv7hf-supervisor
FILESYSTEM: OK (No signs of docker image corruption)

change-type: minor

@lekkas lekkas requested review from imrehg and thgreasi Aug 12, 2018
@lekkas lekkas force-pushed the add_fs_checks branch from 17063a9 to 81bb702 Aug 12, 2018
@resin-io-versionbot

This comment has been minimized.

Copy link

resin-io-versionbot bot commented Aug 12, 2018

@lekkas, status checks have failed for this PR. Please make appropriate changes and recommit.

for i in "${images[@]}"
do
echo " - $i"
$docker_name save --output /dev/null "$i"

This comment has been minimized.

Copy link
@imrehg

imrehg Aug 13, 2018

Contributor

@lekkas this form with --output does save the image to a temp file, and thus memory limit might fail the check, even if it would otherwise work. E.g. on my test device:

root@246d1c3:~# balena save --output /dev/null d491939c6883
write /dev/.docker_temp_068568365: no space left on device

On the other hand, $docker_name save "$i" > /dev/null does not save to an intermediate file, so it is not limited by space/memory, should be the right way.

{
echo "FILESYSTEM: Checking docker images for signs of corruption"

local images=($($docker_name images | awk '{print $1}' | grep -v '<none>\|REPOSITORY'))

This comment has been minimized.

Copy link
@imrehg

imrehg Aug 13, 2018

Contributor

Probably would run the test by the image ID, as there are some versions of resinOS might have multiple tags pointing to the same ID, and then the test might run multiple times. It's a bit less informative, but together with a $docker_name images later, it should be fine to run it like $docker_name images -q | uniq, IMHO.

echo " - $i"
$docker_name save --output /dev/null "$i"
if [ $? != 0 ]; then
echo "FILESYSTEM: DANGER: Image "$i" appears corrupted"

This comment has been minimized.

Copy link
@petrosagg

petrosagg Aug 21, 2018

Member

This will always happen if deltas v2 are enabled, we should check if that's the case before reporting a failure

@shaunmulligan

This comment has been minimized.

Copy link

shaunmulligan commented Nov 20, 2018

@lekkas we should pick this up again, its a good check to have.

@xginn8

This comment has been minimized.

Copy link
Collaborator

xginn8 commented Sep 2, 2019

i'm taking over this PR as lekkas is gone

@xginn8 xginn8 force-pushed the add_fs_checks branch from 81bb702 to 4781867 Sep 2, 2019
@lekkas lekkas requested a review from xginn8 as a code owner Sep 2, 2019
@xginn8 xginn8 force-pushed the add_fs_checks branch from 4781867 to b9ab471 Feb 5, 2020
@lekkas lekkas requested review from wrboyce and ZubairLK as code owners Feb 5, 2020
@xginn8 xginn8 force-pushed the add_fs_checks branch from b9ab471 to ff186cb Mar 13, 2020
Fixes: #44
Change-type: minor
Signed-off-by: Matthew McGinn <matthew@balena.io>
@xginn8 xginn8 force-pushed the add_fs_checks branch from ff186cb to 7298a3f Mar 13, 2020
local corrupted=()
local -i corrupted_count=0
# TODO: this command is filtering out any images with a size <1Kb (for delta-based images)
images=$(${TIMEOUT_CMD} ${ENG} image ls --format "{{.ID}} {{.Size}}" | awk '/![1-9]*B/{next;} {print $1}' | sort -u)

This comment has been minimized.

Copy link
@xginn8

xginn8 Mar 13, 2020

Collaborator

I couldn't figure out a nice way to dehumanize these values from the engine directly.

This comment has been minimized.

Copy link
@robertgzr

robertgzr Mar 23, 2020

Member

fwiw delta images have the labels io.resin.delta.config and io.resin.delta.base and you might be able to filter for them with the --filter flag

cc @dfunckt I think this might only be valid for the engine-based delta's though

This comment has been minimized.

Copy link
@dfunckt

dfunckt Mar 23, 2020

Member

Yes @robertgzr, that is correct.

@xginn8

This comment has been minimized.

Copy link
Collaborator

xginn8 commented Mar 13, 2020

@thgreasi I know you're excited for this PR, mind taking a look?

@thgreasi

This comment has been minimized.

Copy link
Member

thgreasi commented Mar 13, 2020

@xginn8 Will try, even though I'm not a bash guy.
Also, is there any scratchpad entry (the JF one) that this is based on?

@thgreasi

This comment has been minimized.

Copy link
Member

thgreasi commented Mar 13, 2020

Given that the title of this PR is "FS corruption checks", let me also drop here the grep that I'm often using to check the journal:

journalctl -n 10000 | grep "corrupt\|Data will be lost\|block allocation failed\|invalid magic\|dockerd: error -117"

That's purely based on what I've seen so far on support and might need fixes.

@xginn8

This comment has been minimized.

Copy link
Collaborator

xginn8 commented Mar 13, 2020

@xginn8 Will try, even though I'm not a bash guy.
Also, is there any scratchpad entry (the JF one) that this is based on?

There is https://jel.ly.fish/cc663f1d-883b-49aa-bf4d-8d2dac623724

@xginn8

This comment has been minimized.

Copy link
Collaborator

xginn8 commented Mar 13, 2020

Given that the title of this PR is "FS corruption checks", let me also drop here the grep that I'm often using to check the journal:

journalctl -n 10000 | grep "corrupt\|Data will be lost\|block allocation failed\|invalid magic\|dockerd: error -117"

That's purely based on what I've seen so far on support and might need fixes.

This is also quite useful, and perhaps can be worked into #178?

@thgreasi

This comment has been minimized.

Copy link
Member

thgreasi commented Mar 15, 2020

This is also quite useful, and perhaps can be worked into #178?

Sounds great ❤️

@xginn8 xginn8 requested a review from shaunmulligan Mar 20, 2020
local corrupted=()
local -i corrupted_count=0
# TODO: this command is filtering out any images with a size <1Kb (for delta-based images)
images=$(${TIMEOUT_CMD} ${ENG} image ls --format "{{.ID}} {{.Size}}" | awk '/![1-9]*B/{next;} {print $1}' | sort -u)

This comment has been minimized.

Copy link
@robertgzr

robertgzr Mar 23, 2020

Member

fwiw delta images have the labels io.resin.delta.config and io.resin.delta.base and you might be able to filter for them with the --filter flag

cc @dfunckt I think this might only be valid for the engine-based delta's though

do
# TODO: the timeout here is probably too short to be effective in most cases, but let's gather
# data and change if necessary
if ! ${TIMEOUT_CMD} ${ENG} save "${i}" > /dev/null ; then

This comment has been minimized.

Copy link
@robertgzr

robertgzr Mar 23, 2020

Member

is this check not also sensitive issues with disk IO?

This comment has been minimized.

Copy link
@xginn8

xginn8 Mar 24, 2020

Collaborator

It is definitely sensitive. As a precaution I've timed out the engine call normally, but I suspect that might be not long enough (cause too many timeouts instead of real results). I thought the timeout data was collected and returned separate from the failure to save, but I shall fix that.

images=$(${TIMEOUT_CMD} ${ENG} image ls --format "{{.ID}} {{.Size}}" | awk '/![1-9]*B/{next;} {print $1}' | sort -u)

if (( ${#images[@]} > 0 )); then
for i in ${images};

This comment has been minimized.

Copy link
@thgreasi

thgreasi Mar 26, 2020

Member

NIT: How about img instead of just i?

local corrupted=()
local -i corrupted_count=0
# TODO: this command is filtering out any images with a size <1Kb (for delta-based images)
images=$(${TIMEOUT_CMD} ${ENG} image ls --format "{{.ID}} {{.Size}}" | awk '/![1-9]*B/{next;} {print $1}' | sort -u)

This comment has been minimized.

Copy link
@thgreasi

thgreasi Mar 26, 2020

Member

My awk-fu is bad. Can I get some pointers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

8 participants
You can’t perform that action at this time.