New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: Low timeout for NVMe devices #2484

Open
johanneswuerbach opened this Issue Jul 30, 2018 · 4 comments

Comments

Projects
None yet
5 participants
@johanneswuerbach

johanneswuerbach commented Jul 30, 2018

Issue Report

Bug

Container Linux Version

1800.5.0

Environment

What hardware/cloud provider/hypervisor is being used to run Container Linux?

AWS us-east-1 c5.2xlarge

Expected Behavior

Having an NVMe I/O Operation Timeout configured according to the recommendations from AWS. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes

Actual Behavior

The NVMe timeout defaults to the kernel default of 30 seconds.

$ cat /sys/module/nvme_core/parameters/io_timeout
30

Other Information

Might be related to #2371

@bgilbert bgilbert changed the title from AWS: Low timeout for to AWS: Low timeout for NVMe devices Jul 31, 2018

@mariusgrigoriu

This comment has been minimized.

Show comment
Hide comment
@mariusgrigoriu

mariusgrigoriu Aug 3, 2018

Where in the CoreOS code-base would this value be updated?

mariusgrigoriu commented Aug 3, 2018

Where in the CoreOS code-base would this value be updated?

@Alalk

This comment has been minimized.

Show comment
Hide comment
@Alalk

Alalk Aug 8, 2018

@johanneswuerbach i'm trying to test this to see if the aws docs will fix this.

I was able to get the nvme_core params in the boot. I had to edit the grub in
/usr/share/oem/grub.cfg in a running ec2, then create an ami and try on a new ec2.

i'm still not see the drives get picked up. Although i'm trying on the i3 bare.metal.

[    0.000000] Command line: BOOT_IMAGE=/coreos/vmlinuz-a mount.usr=/dev/mapper/usr verity.usr=PARTUUID=7130c94a-213a-4e5a-8e26-6cce9662f132 rootflags=rw mount.usrflags=ro consoleblank=0 root=LABEL=ROOT console=ttyS0,115200n8 nvme_core.io_timeout=4294967295 nvme_core.max_retries=10 coreos.oem.id=ec2 modprobe.blacklist=xen_fbfront net.ifnames=0 verity.usrhash=398d83dd5252c42312d7ff4b49d0b854072cfcc03657d72e7c792ae24d60077e


Alalk commented Aug 8, 2018

@johanneswuerbach i'm trying to test this to see if the aws docs will fix this.

I was able to get the nvme_core params in the boot. I had to edit the grub in
/usr/share/oem/grub.cfg in a running ec2, then create an ami and try on a new ec2.

i'm still not see the drives get picked up. Although i'm trying on the i3 bare.metal.

[    0.000000] Command line: BOOT_IMAGE=/coreos/vmlinuz-a mount.usr=/dev/mapper/usr verity.usr=PARTUUID=7130c94a-213a-4e5a-8e26-6cce9662f132 rootflags=rw mount.usrflags=ro consoleblank=0 root=LABEL=ROOT console=ttyS0,115200n8 nvme_core.io_timeout=4294967295 nvme_core.max_retries=10 coreos.oem.id=ec2 modprobe.blacklist=xen_fbfront net.ifnames=0 verity.usrhash=398d83dd5252c42312d7ff4b49d0b854072cfcc03657d72e7c792ae24d60077e


@Alalk

This comment has been minimized.

Show comment
Hide comment
@Alalk

Alalk Aug 9, 2018

@johanneswuerbach
So i got this working. The grub file needs to be changed to have this nvme_core defaults set. Also it only works on the kernel version above 4.15. I will see if i can get a pull request into coreos for the fix.

Alalk commented Aug 9, 2018

@johanneswuerbach
So i got this working. The grub file needs to be changed to have this nvme_core defaults set. Also it only works on the kernel version above 4.15. I will see if i can get a pull request into coreos for the fix.

@Alalk Alalk referenced this issue Aug 10, 2018

Closed

Update grub.cfg #834

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

nvme timeout issue.
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

nvme timeout issue patch. for kernel blow 4.15
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
05c1f12

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

nvme timeout issue
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
05c1f12

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

nvme timeout issue patch
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

Update grub-ec2.cfg
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes

Alalk added a commit to Alalk/coreos-overlay that referenced this issue Aug 10, 2018

Update grub-ec2.cfg
updating the grub config to use the nvme defaults required by aws. Should solve the failure to pass status checks. (eventually)
This only works on kernel version above 4.15. (core timeout max is 255 for below 4.15)

coreos/bugs#2464
coreos/bugs#2484
coreos/bugs#2371
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes
@r7vme

This comment has been minimized.

Show comment
Hide comment
@r7vme

r7vme Sep 24, 2018

We are also hitting this issue, setting timeouts (255 sec) and retries (10) seems helps.

Would love to have this "out-of-the-box" in AWS images.

r7vme commented Sep 24, 2018

We are also hitting this issue, setting timeouts (255 sec) and retries (10) seems helps.

Would love to have this "out-of-the-box" in AWS images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment