Skip to content
This repository has been archived by the owner. It is now read-only.

AWS: Low timeout for NVMe devices #2484

Closed
johanneswuerbach opened this issue Jul 30, 2018 · 9 comments
Closed

AWS: Low timeout for NVMe devices #2484

johanneswuerbach opened this issue Jul 30, 2018 · 9 comments

Comments

@johanneswuerbach
Copy link

@johanneswuerbach johanneswuerbach commented Jul 30, 2018

Issue Report

Bug

Container Linux Version

1800.5.0

Environment

What hardware/cloud provider/hypervisor is being used to run Container Linux?

AWS us-east-1 c5.2xlarge

Expected Behavior

Having an NVMe I/O Operation Timeout configured according to the recommendations from AWS. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes

Actual Behavior

The NVMe timeout defaults to the kernel default of 30 seconds.

$ cat /sys/module/nvme_core/parameters/io_timeout
30

Other Information

Might be related to #2371

@mariusgrigoriu
Copy link

@mariusgrigoriu mariusgrigoriu commented Aug 3, 2018

Where in the CoreOS code-base would this value be updated?

Loading

@Alalk
Copy link

@Alalk Alalk commented Aug 8, 2018

@johanneswuerbach i'm trying to test this to see if the aws docs will fix this.

I was able to get the nvme_core params in the boot. I had to edit the grub in
/usr/share/oem/grub.cfg in a running ec2, then create an ami and try on a new ec2.

i'm still not see the drives get picked up. Although i'm trying on the i3 bare.metal.

[    0.000000] Command line: BOOT_IMAGE=/coreos/vmlinuz-a mount.usr=/dev/mapper/usr verity.usr=PARTUUID=7130c94a-213a-4e5a-8e26-6cce9662f132 rootflags=rw mount.usrflags=ro consoleblank=0 root=LABEL=ROOT console=ttyS0,115200n8 nvme_core.io_timeout=4294967295 nvme_core.max_retries=10 coreos.oem.id=ec2 modprobe.blacklist=xen_fbfront net.ifnames=0 verity.usrhash=398d83dd5252c42312d7ff4b49d0b854072cfcc03657d72e7c792ae24d60077e


Loading

@Alalk
Copy link

@Alalk Alalk commented Aug 9, 2018

@johanneswuerbach
So i got this working. The grub file needs to be changed to have this nvme_core defaults set. Also it only works on the kernel version above 4.15. I will see if i can get a pull request into coreos for the fix.

Loading

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
05c1f12
Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
05c1f12
Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
@r7vme
Copy link

@r7vme r7vme commented Sep 24, 2018

We are also hitting this issue, setting timeouts (255 sec) and retries (10) seems helps.

Would love to have this "out-of-the-box" in AWS images.

Loading

@bgilbert
Copy link
Member

@bgilbert bgilbert commented Nov 15, 2018

Closing as duplicate of #2464.

Loading

@bgilbert
Copy link
Member

@bgilbert bgilbert commented Mar 14, 2019

Reopening per #2464 (comment).

Loading

@bgilbert bgilbert reopened this Mar 14, 2019
@dm0-
Copy link

@dm0- dm0- commented May 7, 2019

The NVMe timeout has been changed for EC2 in the current alpha. It will promote to stable in mid-June.

Loading

@bgilbert bgilbert closed this May 16, 2019
@pms1969
Copy link

@pms1969 pms1969 commented Jul 8, 2019

I'm running a mix of 2079.6 and 2135.4 and neither has the correct setting... What version is meant to have this fix?

Loading

@bgilbert
Copy link
Member

@bgilbert bgilbert commented Jul 8, 2019

It's in 2135.0.0 and above, but only for new installs. Machines that are upgraded from older releases retain their previous settings.

Loading

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants