New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreOS 1855.4.0 AWS EBS Mount Lockup #2511

Open
pctj101 opened this Issue Oct 17, 2018 · 16 comments

Comments

Projects
None yet
4 participants
@pctj101

pctj101 commented Oct 17, 2018

Issue Report

Bug

Ignition crashes system if storage.filesystem is specified

CT Input

storage:
  filesystems:
    - name: data
      mount:
        device: /dev/sdb
        format: ext4
        wipe_filesystem: true
        label: DATA

Convert to userdata
ct < test.ct

{"ignition":{"config":{},"security":{"tls":{}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"filesystems":[{"mount":{"device":"/dev/sdb","format":"ext4","label":"DATA","wipeFilesystem":true},"name":"data"}]},"systemd":{}}

Container Linux Version

CoreOS-stable-1855.4.0-hvm (ami-086eb64b7f4485a72)
ct v0.9.0

Environment

AWS

Expected Behavior

At minimum format my block device

Actual Behavior

System does not boot
Can't login, so can't get logs
Screenshot
https://www.evernote.com/l/AE__MLODCjJN_p8vv9G_LkqC2nBnb6BbAqI

Reproduction Steps

  1. Create EC2 instance, attach 80GB EBS to /dev/sdb, add user data, boot and crash

Other Information

Worked before on older CoreOS-stable-1688.5.3-hvm (ami-a2b6a2de)

Manually booting without CT/Ignition allows manual format/mounting of /dev/sdb (mounting by label is also no problem)

@pctj101 pctj101 changed the title from CoreOS 1855.4.0 AWS EBS to CoreOS 1855.4.0 AWS EBS Mount Lockup Oct 17, 2018

@ajeddeloh

This comment has been minimized.

ajeddeloh commented Oct 17, 2018

Thanks for the report. This probably isn't an Ignition bug but rather a kernel bug since Ignition didn't change between 1855.3.0 and 1855.4.0. Can you repro on alpha?

@pctj101

This comment has been minimized.

pctj101 commented Oct 17, 2018

Will check tomorrow :)

@pctj101

This comment has been minimized.

pctj101 commented Oct 18, 2018

@ajeddeloh - Same issue on alpha:
CoreOS-alpha-1925.0.0-hvm (ami-01d20d68c856200cc)

Also please note previous working version was much older:
CoreOS-stable-1688.5.3-hvm (ami-a2b6a2de)

@enieuw

This comment has been minimized.

enieuw commented Oct 19, 2018

This happens for me as well on the latest gen instances. Switching instances from t2 and t3 results into the system hanging on a systemd unit that's waiting for device "/dev/xvdg".

Perhaps this has something to do with switching to the NVME names that t3 instances do.

@enieuw

This comment has been minimized.

enieuw commented Oct 19, 2018

Fetched one of the logs from the machines, I see lots of these messages:

[�[0m�[0;31m*     �[0m] (1 of 3) A start job is running for dev-xvdg.device (4s / 1min 30s)
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] (1 of 3) A start job is running for dev-xvdg.device (5s / 1min 30s)
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] (1 of 3) A start job is running for dev-xvdg.device (5s / 1min 30s)
�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] (2 of 3) A start job is running for Ignition (disks) (10s / no limit)
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] (2 of 3) A start job is running for Ignition (disks) (11s / no limit)
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m] (2 of 3) A start job is running for Ignition (disks) (11s / no limit)
�[K[    �[0;31m*�[0;1;31m*�[0m] (3 of 3) A start job is running for…mapper-usr.device (12s / no limit)
�[K[     �[0;31m*�[0m] (3 of 3) A start job is running for…mapper-usr.device (12s / no limit)
�[K[    �[0;31m*�[0;1;31m*�[0m] (3 of 3) A start job is running for…mapper-usr.device (13s / no limit)
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m]
 (1 of 3) A start job is running for dev-xvdg.device (9s / 1min 30s)
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] (1 of 3) A start job is running for dev-xvdg.device (9s / 1min 30s)
�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] (1 of 3) A start job is running for dev-xvdg.device (10s / 1min 30s)
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] (2 of 3) A start job is running for Ignition (disks) (15s / no limit)
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] (2 of 3) A start job is running for Ignition (disks) (15s / no limit)
�[K[�[0m�[0;31m*     �[0m] (2 of 3) A start job is running for Ignition (disks) (16s / no limit)
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] (3 of 3) A start job is running for…mapper-usr.device (16s / no limit)
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] (3 of 3) A start job is running for…mapper-usr.device (17s / no limit)[   24.010121] systemd-networkd[242]: eth0: Configured

�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] (3 of 3) A start job is running for…mapper-usr.device (17s / no limit)
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] (1 of 3) A start job is running for dev-xvdg.device (13s / 1min 30s)
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m] (1 of 3) A start job is running for dev-xvdg.device (14s / 1min 30s)
�[K[    �[0;31m*�[0;1;31m*�[0m] (1 of 3) A start job is running for dev-xvdg.device (14s / 1min 30s)
�[K[     �[0;31m*�[0m] (2 of 3) A start job is running for Ignition (disks) (19s / no limit)

It eventually times out:

[  101.111108] systemd[1]: Timed out waiting for device dev-xvdg.device.
[�[0;1;31mFAILED�[0m[  101.154243] ] ignitionFailed to start Ignition (disks).[415]: 
disks: createFilesystems: op(1): [failed]   waiting for devices [/dev/xvdg]: device unit dev-xvdg.device timeout
See 'systemctl status ignition-disks.service' for details.[  101.159042] 
systemd[1]: dev-xvdg.device: Job dev-xvdg.device/start failed with result 'timeout'.
@pctj101

This comment has been minimized.

pctj101 commented Oct 19, 2018

@enieuw Hey... how did you get that system log? My AWS EC2 system logs are blank :(

@enieuw

This comment has been minimized.

enieuw commented Oct 19, 2018

@enieuw Hey... how did you get that system log? My AWS EC2 system logs are blank :(

It takes a while but eventually they show up under "Instance Settings -> Get system log".

Which instance type are you running by the way?

@pctj101

This comment has been minimized.

pctj101 commented Oct 19, 2018

For this debug session I was running t2/t3/m5 (can't remember the exact size)

@enieuw

This comment has been minimized.

enieuw commented Oct 19, 2018

Creating a VM as a t2 instance and then changing the instance type to t3 works, I actually see the symlinks working:

Container Linux by CoreOS stable (1855.4.0)
core@ip-10-14-30-4 ~ $ systemctl status dev-xvdg.device
● dev-xvdg.device - Amazon Elastic Block Store
   Follow: unit currently follows state of sys-devices-pci0000:00-0000:00:1f.0-nvme-nvme1-nvme1n1.device
   Loaded: loaded
   Active: active (plugged) since Fri 2018-10-19 06:49:00 UTC; 1min 8s ago
   Device: /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1

Oct 19 06:49:00 ip-10-14-30-4 systemd[1]: Found device Amazon Elastic Block Store.
core@ip-10-14-30-4 ~ $ date
Fri Oct 19 06:50:14 UTC 2018
core@ip-10-14-30-4 ~ $ ls -al /dev/xvdg
lrwxrwxrwx. 1 root root 7 Oct 19 06:49 /dev/xvdg -> nvme1n1
core@ip-10-14-30-4 ~ $

Creating a fresh T3 instance results in the hanging behaviour

@pctj101

This comment has been minimized.

pctj101 commented Oct 19, 2018

Ah I'm betting that's because when changing the instance type that Ignition doesn't run on the second boot.

@enieuw

This comment has been minimized.

enieuw commented Oct 19, 2018

Yeah most likely. It doesn't trigger the wait for the systemd unit and then it does continue booting.

If I specify /dev/nvme1n1 in my ignition file it does boot properly. Perhaps the call to systemd is done before udev has mapped the aliases added by #2399

@lucab

This comment has been minimized.

Member

lucab commented Oct 19, 2018

@enieuw I think you are waiting for coreos/bootengine#149 to do that.

@pctj101

This comment has been minimized.

pctj101 commented Oct 22, 2018

Okay looks like a mismatch between assigning EBS to /dev/sdb in the AWS console and /dev/xvdb appearing in linux.

ap-northeast-1
t2.micro
CoreOS-stable-1855.4.0-hvm (ami-086eb64b7f4485a72)
Root device /dev/xvda
Block devices /dev/xvda /dev/sdb

{"ignition":{"config":{},"security":{"tls":{}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"filesystems":[{"mount":{"device":"/dev/sdb","format":"ext4","label":"DATA","wipeFilesystem":true},"name":"data"}]},"systemd":{}}

Results in:

[0;1;31m[0mdisks: createFilesystems: op(1): [started]  waiting for devices [/dev/sdb]
[0;1;31m[0mdisks: createFilesystems: op(1): [failed]   waiting for devices [/dev/sdb]: device unit dev-sdb.device timeout
[0;1;31m[0mdisks: failed to create filesystems: failed to wait on filesystems devs: device unit dev-sdb.device timeout

Updating the config from sdb -> xvdb finishes the boot.

Is there already a ticket for sdb vs xvdb? I think on some systems (can't remember) /dev/sdb shows up instead.

@pctj101

This comment has been minimized.

pctj101 commented Oct 23, 2018

As a follow on thought, it seems sometimes EBS volumes (add-on disks on AWS) show up as /dev/sdb and sometimes as /dev/xvdb. That makes ignition scripts fail if mismatched and makes it difficult to use the same script on various servers.

Is there any guidance on /dev/sdb vs /dev/xvdb going forward in coreos? Perhaps following such guidance would have prevented this ticket.

@lucab

This comment has been minimized.

Member

lucab commented Oct 23, 2018

@pctj101 this is an unfortunate choice on AWS side, see #2399 (comment). Their volumes/instances/names grid is documented here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html

@pctj101

This comment has been minimized.

pctj101 commented Oct 23, 2018

@lucab - Yes I too have seen where my EC2 launch spec and coreos device path mismatch. I think it's related to this item on the same page as the link you provided:

Depending on the block device driver of the kernel, the device could be attached with a different name than you specified. For example, if you specify a device name of /dev/sdh, your device could be renamed /dev/xvdh or /dev/hdh.

So it seems that the kernel configuration (and thus CoreOS) may also have some interaction. So it's not just "It's AWS", but "It's AWS and How CoreOS interact" which is why I'm bringing up this question. :)

Anyways, yes I read the other thread you linked to. I can share with you that on device mapping I've abandoned Ignition and resorted to a series of shell scripts to format and mount things properly (despite instance type changes). I'm not sure if that's the long term way to do it, but I'm pretty sure the discussion either way is lengthy and has plenty of ideology to go with it :)

When it comes to AWS totally changing device paths for NVMe, even I have trouble justifying automagic resolution with ignition.

It's definitely a usability discussion rather than a bug discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment