Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to mount sysroot on reboot for nodes with a 'large' disk #2485

Closed
basvdlei opened this Issue Jul 31, 2018 · 14 comments

Comments

Projects
None yet
4 participants
@basvdlei
Copy link

basvdlei commented Jul 31, 2018

Issue Report

Bug

Container Linux Version

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1800.5.0
VERSION_ID=1800.5.0
BUILD_ID=2018-07-28-2250
PRETTY_NAME="Container Linux by CoreOS 1800.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

VMWare ESXi

Expected Behavior

When rebooting a node with a "large" disk it should be able to mount sysroot.

Actual Behavior

A node with a "large" disk fails to mount sysroot when it's rebooted:

systemd[1]: Mounting /sysroot...
EXT4-fs (sda9): ext4_check_descriptors: Block bitmap for group 0 overlaps block group descriptors
EXT4-fs (sda9): group descriptors corrupted!
mount[419]: mount: /sysroot: mount(2) system call failed: Structure needs cleaning.
systemd[1]: sysroot.mount: Mount process exited, code=exited status=32
Failed to mount /sysroot.

sysroot-mount-failed

Reproduction Steps

  1. Create a CoreOS node with a 3.91TB disk (have not been able to test other sizes yet)
  2. Root filesystem is resized and mounted correctly on the first boot
  3. Reboot the machine
  4. Mount of /sysroot fails during the boot

Other Information

We first observed this issue when a machine with a 3.91TB failed to update from 1745.7.0 to 1800.4.0. Version 1745.7.0 was still able to mount the filesystem while 1800.4.0 gave the error described above.

It looks like some regression was introduced in kernel 4.14.55 with the ext4 changes https://lwn.net/Articles/759535/ and (from what we could gather) this may even be the patch: https://patchwork.ozlabs.org/patch/950668/

All of our machines with smaller disks (<500GB) still boot and reboot correctly.

@adarshaj

This comment has been minimized.

Copy link

adarshaj commented Jul 31, 2018

We are affected by this too. Our platform is baremetal with 4TB disk.
I suspected something to do with using rook's cephfs as we used one of the directories on root partition for OSD. We tried running fsck (with latest e2fsprogs release, built on Jul 10 2018) but fsck reports disk as okay. I am attaching the rdsosreport.txt in case its helpful for triaging issue.

fwiw, I did a fresh installation after doing sgdisk -z /dev/sda (zap gpt), this failed too with the exact same issue, so I'm pretty sure its something with new kernel, not harddisk (also, smartctl reports disk as healthy). However, doing mkfs.ext4 -S /dev/sda9(WARNING: Lost all data), the above error stopped occuring, but root partition was completely erased (after running fsck) with lots and lots and lots of invalid metadata on inodes.

I tried with 1800.5.0 too, but same issue persists.

@adarshaj

This comment has been minimized.

Copy link

adarshaj commented Jul 31, 2018

This seems to be the fix - torvalds/linux@5012284 (read the commit msg for details and linked culprit commit at torvalds/linux@8844618 -- which is exactly the behavior reported above in the logs). How can we test this kernel?

@dm0-

This comment has been minimized.

Copy link
Member

dm0- commented Jul 31, 2018

I've cherry-picked the upcoming ext4 fixes (including the commit you linked) onto the current stable and produced a test image here: http://builds.developer.core-os.net/boards/amd64-usr/1800.5.0%2Bjenkins2-build-1800%2Blocal-1683/coreos_production_image.bin.bz2

Can you confirm that resolves the issue?

@adarshaj

This comment has been minimized.

Copy link

adarshaj commented Jul 31, 2018

Is there a way to test this without nuking the ROOT labelled partition? (for context, I'm running a node in tectonic cluster with cluo managing upgrades)

@dm0-

This comment has been minimized.

Copy link
Member

dm0- commented Jul 31, 2018

If the failure is reproducible by just mounting the root partition, you could try booting the ISO or PXE version and mounting the disk manually. That way, nothing will be overwritten on persistent storage.

@basvdlei

This comment has been minimized.

Copy link
Author

basvdlei commented Aug 1, 2018

Even with the test image, I'm still able to reproduce this issue.

I took both a 1800.5.0 image (https://stable.release.core-os.net/amd64-usr/1800.5.0/coreos_production_image.bin.bz2) and the test image of @dm0- above and ran through the following scenario.

  • Convert and resize the raw image to a 4TB qcow2 image
qemu-img convert -p -O qcow2 coreos_production_image.bin coreos_production_image.qcow2
qemu-img resize coreos_production_image.qcow2 4T
  • Created an booted a KVM VM using this image.
  • Let it boot to the login prompt.
  • Trigger a reboot.

kvm-test

@basvdlei

This comment has been minimized.

Copy link
Author

basvdlei commented Aug 1, 2018

Did a couple of more tests with different disk sizes (1TB -> 2TB -> 3TB). The 1TB and 2TB cases worked fine.

The 3TB drive case failed. I also noticed that it displayed an additional log line during the resizing:

EXT4-fs (sda9): Converting file system to meta_bg

Just to make sure, was this commit included in the test image? torvalds/linux@44de022

@bgilbert

This comment has been minimized.

Copy link
Member

bgilbert commented Aug 2, 2018

torvalds/linux@44de022 is not currently in the stable queue for kernels older than 4.17 because of a trivial patch conflict (see e.g. 4.14). I've reproduced the issue on 1800.5.0, and confirmed that the combination of torvalds/linux@44de022 and the other ext4 changes queued for 4.14 fixes the problem.

@bgilbert

This comment has been minimized.

Copy link
Member

bgilbert commented Aug 2, 2018

Backport posted to stable@.

@adarshaj

This comment has been minimized.

Copy link

adarshaj commented Aug 2, 2018

So we should wait until a new point release gets to stable channel here: https://coreos.com/releases/ before upgrading, right?

@bgilbert

This comment has been minimized.

Copy link
Member

bgilbert commented Aug 2, 2018

@adarshaj Correct. We'll backport the fix to the existing release branches.

@basvdlei

This comment has been minimized.

Copy link
Author

basvdlei commented Aug 8, 2018

Thanks! Release 1800.6.0 solves this issue for us. We successfully updated our 4TB nodes.

@adarshaj

This comment has been minimized.

Copy link

adarshaj commented Aug 9, 2018

I can confirm too, all the bare metal instances with 4TB disks have successfully upgraded and harddisks are getting mounted without any issues. Thanks!

I guess this issue can be closed now.

@bgilbert

This comment has been minimized.

Copy link
Member

bgilbert commented Aug 13, 2018

Fixed in alpha 1855.1.0, beta 1828.3.0, and stable 1800.6.0, and upstream in kernel 4.14.62. Thanks for reporting.

@bgilbert bgilbert closed this Aug 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.