New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coreos AMI CoreOS-stable-1745.7.0-hvm fails to start on i3.metal instances #2464

Open
bajacondor opened this Issue Jun 19, 2018 · 12 comments

Comments

Projects
None yet
7 participants
@bajacondor

bajacondor commented Jun 19, 2018

Issue Report

Bug

Container Linux Version

 CoreOS-stable-1745.7.0-hvm

Environment

AWS Ec2 INstance i3.metal

Expected Behavior

Instance starts, status checks pass and ssh access is available

Actual Behavior

Instance shows running, status checks fail, ssh access is not available

Reproduction Steps

launch an aws EC2 i3.metal instance with the CoreOS-stable-1745.7.0-hvm (ami-662f6d1e) AMI

Other Information

System logs from instance instance settings show:

[   21.830940] ehci-pci 0000:00:1d.0: EHCI[   36.425579] systemd-networkd[740]: eth0: Configured
[   52.191081] nvme nvme0: I/O 23 QID 1 timeout, aborting
[   52.191279] nvme nvme0: I/O 24 QID 1 timeout, aborting
[   52.191449] nvme nvme0: I/O 25 QID 1 timeout, aborting
[   52.191617] nvme nvme0: I/O 26 QID 1 timeout, aborting

Which indicates an NVMe driver issue.

@bgilbert

This comment has been minimized.

Show comment
Hide comment
@bgilbert

bgilbert Jun 19, 2018

Member

Was this working for you on older Container Linux versions?

Member

bgilbert commented Jun 19, 2018

Was this working for you on older Container Linux versions?

@Alalk

This comment has been minimized.

Show comment
Hide comment
@Alalk

Alalk Jun 19, 2018

This bug is not limited to CoreOS-stable-1745.7.0-hvm. We saw the same issue with
Beta 1772.4.0
Alpha 1800.1.0

Alalk commented Jun 19, 2018

This bug is not limited to CoreOS-stable-1745.7.0-hvm. We saw the same issue with
Beta 1772.4.0
Alpha 1800.1.0

@bajacondor

This comment has been minimized.

Show comment
Hide comment
@bajacondor

bajacondor Jun 19, 2018

@bgilbert, No we only started this work to run coreOS on i3.metal instances. We have tried a number of different versions including the alpha and beta channels that @Alalk mentions above (@Alalk is on my team) I recently tried stable-14xx and stable-15xx versions as well. All attempts have reproduced this same behavior.

bajacondor commented Jun 19, 2018

@bgilbert, No we only started this work to run coreOS on i3.metal instances. We have tried a number of different versions including the alpha and beta channels that @Alalk mentions above (@Alalk is on my team) I recently tried stable-14xx and stable-15xx versions as well. All attempts have reproduced this same behavior.

@lucab

This comment has been minimized.

Show comment
Hide comment
@lucab

lucab Jun 20, 2018

Member

I think this may be closely related to #2371 (comment).

Member

lucab commented Jun 20, 2018

I think this may be closely related to #2371 (comment).

@Alalk Alalk referenced this issue Aug 10, 2018

Closed

Update grub.cfg #834

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

nvme timeout issue.
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

nvme timeout issue patch. for kernel blow 4.15
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
05c1f12

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

nvme timeout issue
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
05c1f12

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

nvme timeout issue patch
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

Update grub-ec2.cfg
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

Update grub-ec2.cfg
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
@wrotte

This comment has been minimized.

Show comment
Hide comment
@wrotte

wrotte Sep 21, 2018

Hi folks -

do we have any updates on this issue?

thanks

wrotte commented Sep 21, 2018

Hi folks -

do we have any updates on this issue?

thanks

@mariusgrigoriu

This comment has been minimized.

Show comment
Hide comment
@mariusgrigoriu

mariusgrigoriu Sep 21, 2018

@Alalk submitted some PRs for the issue, but it looks like progress halted.

mariusgrigoriu commented Sep 21, 2018

@Alalk submitted some PRs for the issue, but it looks like progress halted.

@wrotte

This comment has been minimized.

Show comment
Hide comment
@wrotte

wrotte Sep 21, 2018

Looks like the Alpha AMI is able to boot on i3.metal instances, I think I can work with that until it gets promoted.

wrotte commented Sep 21, 2018

Looks like the Alpha AMI is able to boot on i3.metal instances, I think I can work with that until it gets promoted.

@bgilbert

This comment has been minimized.

Show comment
Hide comment
@bgilbert

bgilbert Sep 26, 2018

Member

@wrotte Note that we're maintaining the beta and stable channels on the 4.14 kernel, so a promoted alpha won't actually help here.

Member

bgilbert commented Sep 26, 2018

@wrotte Note that we're maintaining the beta and stable channels on the 4.14 kernel, so a promoted alpha won't actually help here.

@wrotte

This comment has been minimized.

Show comment
Hide comment
@wrotte

wrotte Sep 27, 2018

@bgilbert Are there any plans to advance the kernel version on the stable or beta channels?

wrotte commented Sep 27, 2018

@bgilbert Are there any plans to advance the kernel version on the stable or beta channels?

@bgilbert

This comment has been minimized.

Show comment
Hide comment
@bgilbert

bgilbert Sep 28, 2018

Member

The plan is to keep beta and stable on 4.14 for now. After the next LTS kernel is out and has baked in alpha for awhile, we'll decide whether to promote it to the other channels.

Member

bgilbert commented Sep 28, 2018

The plan is to keep beta and stable on 4.14 for now. After the next LTS kernel is out and has baked in alpha for awhile, we'll decide whether to promote it to the other channels.

@johanneswuerbach

This comment has been minimized.

Show comment
Hide comment
@johanneswuerbach

johanneswuerbach Sep 28, 2018

Which means CoreOS won‘t work (this issue & coreos/coreos-overlay#3367) on all new machine types (c5, m5, t3, i3, etc.) in the near future and we should start looking for an alternative OS if we plan to use them?

johanneswuerbach commented Sep 28, 2018

Which means CoreOS won‘t work (this issue & coreos/coreos-overlay#3367) on all new machine types (c5, m5, t3, i3, etc.) in the near future and we should start looking for an alternative OS if we plan to use them?

@bgilbert

This comment has been minimized.

Show comment
Hide comment
@bgilbert

bgilbert Oct 1, 2018

Member

@johanneswuerbach 4.19 should go final within the next couple weeks. It should be in alpha soon after that, and we'll certainly keep this issue in mind as we decide whether to promote it past alpha.

Member

bgilbert commented Oct 1, 2018

@johanneswuerbach 4.19 should go final within the next couple weeks. It should be in alpha soon after that, and we'll certainly keep this issue in mind as we decide whether to promote it past alpha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment