Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ec2 instance with localnvme storage fails to boot with fcos stable version: 36.20220906.3.2 #1306

Closed
gongx opened this issue Sep 26, 2022 · 18 comments · Fixed by coreos/fedora-coreos-config#2291
Labels

Comments

@gongx
Copy link

gongx commented Sep 26, 2022

Describe the bug
The ec2 instance with localnvme storage (i3, c5d, etc) fails to boot by using fcos stable version: 36.20220906.3.2 with ami-0dbce9bea71a2ee29

Expected behavior
The ec2 instance with localnvme storage can start successfully by using fcos stable version: 36.20220906.3.2 with ami-0dbce9bea71a2ee29

we have tried with the previous stable version: Fedora CoreOS 36.20220806.3.0 which works fine

Actual behavior
Noticed the following errors from system logs
systemd[1]: dev-nvme0n1.device: Job dev-nvme0n1.device/start timed out. TIME �[0m] Timed out waiting for device �[0;1;�nvme0n1.device[ 97.678110] systemd[1]: Timed out waiting for device dev-nvme0n1.device - /dev/nvme0n1. �[0m - /dev/nvme0n1. systemd[1]: dev-nvme0n1.device: Job dev-nvme0n1.device/start failed with result 'timeout'. [ 97.690574] ignition[703]: disks: createPartitions: op(1): [failed] waiting for devices [/dev/nvme0n1]: device unit dev-nvme0n1.device timeout [ 97.696737] systemd[1]: ignition-disks.service: Main process exited, code=exited, status=1/FAILURE [ 97.701377] ignition[703]: disks failed

System details

  • AWS
  • Fedora CoreOS version 36.20220906.3.2
@gongx gongx added the kind/bug label Sep 26, 2022
@gongx gongx changed the title node boot fails with fcos stable version: 36.20220906.3.2 and ami-0dbce9bea71a2ee29 The ec2 instance with localnvme storage fails to boot with fcos stable version: 36.20220906.3.2 Sep 26, 2022
@dustymabe
Copy link
Member

can you give a few exact names of instance types you've tried this on? We should probably enhance our AWS test to test a few more instance types.

@gongx
Copy link
Author

gongx commented Sep 27, 2022

i3.large and c5d.4xlarge

@dustymabe
Copy link
Member

This is reported upstream. It appears the summary of the investigation is that the Linux 5.19.x series exposed an issue with the NVMe controller firmware and they (AWS) are going to rollout a new firmware across the fleet in order to fix the issue (i.e. no changes/reverts in the kernel itself).

If you want a status update on the firmware rollout please ask on the upstream thread or contact AWS customer service.

@gongx
Copy link
Author

gongx commented Oct 3, 2022

Thank you for the update

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Oct 4, 2022
This test ensures that if an nvme device exists it is accessible.
See coreos/fedora-coreos-tracker#1306

This commit also denylists the test with a snooze for the next few
weeks. The hope is that Amazon does the firmware rollout soon.
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Oct 4, 2022
This test ensures that if an nvme device exists it is accessible.
See coreos/fedora-coreos-tracker#1306

This commit also denylists the test with a snooze for the next few
weeks. The hope is that Amazon does the firmware rollout soon.
jlebon pushed a commit to coreos/fedora-coreos-config that referenced this issue Oct 5, 2022
This test ensures that if an nvme device exists it is accessible.
See coreos/fedora-coreos-tracker#1306

This commit also denylists the test with a snooze for the next few
weeks. The hope is that Amazon does the firmware rollout soon.
@dustymabe
Copy link
Member

With coreos/fedora-coreos-config#2005 and coreos/fedora-coreos-pipeline#669 we added a test and we'll know if local NVMe storage regresses again in the future.

The test will be enabled properly once AWS rolls out the controller firmware update.

@dustymabe
Copy link
Member

OK the test seemed to pass in recent tests:

[2022-10-24T14:35:29.338Z] --- PASS: non-exclusive-test-bucket-0 (89.24s)
[2022-10-24T14:35:29.338Z]     --- PASS: non-exclusive-test-bucket-0/ext.config.platforms.aws.assert-xen (2.08s)
[2022-10-24T14:35:29.338Z]     --- PASS: non-exclusive-test-bucket-0/ext.config.platforms.aws.nvme (2.06s)

I assume AWS performed the necessary firmware update.

@gongx - can you confirm things are looking good for you now?

@gongx
Copy link
Author

gongx commented Oct 24, 2022

Cool. Thank you for the update. I will do the test today and will report back whether the fix has been fixed or not.

@gongx
Copy link
Author

gongx commented Oct 24, 2022

I double checked. Looks like that it still fails with

[   97.173192] ignition[701]: disks: createPartitions: op(1): [failed]   waiting for devices [/dev/nvme0n1]: device unit dev-nvme0n1.device timeout
[   97.179712] systemd[1]: dev-nvme0n1.device: Job dev-nvme0n1.device/start failed with result 'timeout'.
[   97.333995] ignition[701]: Ignition failed: create partitions failed: failed to wait on disks devs: device unit dev-nvme0n1.device timeout
[   97.341874] systemd[1]: ignition-disks.service: Main process exited, code=exited, status=1/FAILURE

@dustymabe
Copy link
Member

Just to make sure we're comparing apples to apples - can you try with ami-0a26304758de712ba from us-east-1 on an i3.large instance and see if that one works for you?

@gongx
Copy link
Author

gongx commented Oct 24, 2022

I am using

Region: us-east-1
Release: 36.20221001.3.0
Image: [ami-0756632d3ab28028a](https://console.aws.amazon.com/ec2/home?region=us-east-1#launchAmi=ami-0756632d3ab28028a)

Which version are you using for the test? It will be easier for us to just use the table version.

@dustymabe
Copy link
Member

Hmm. Yes. Testing with ami-0756632d3ab28028a from the current stable release (36.20221001.3.0) fails the test.

Testing with ami-06ce37b8291c5f294 from the current testing release (36.20221014.2.0) fails the test.

Testing with ami-0df8b995ad14e1d2d from the current next release (37.20221015.1.0) passes the test.

So it may be some combination of new software? The testing release and next release mentioned above both have the same version of kernel so maybe something else?

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Oct 25, 2022
It looks like our F37+ streams pass this test now [1] so let's also
only deny the test on streams where it's known to fail.

[1] coreos/fedora-coreos-tracker#1306 (comment)
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Oct 25, 2022
It looks like our F37+ streams pass this test now [1] so let's also
only deny the test on streams where it's known to fail.

[1] coreos/fedora-coreos-tracker#1306 (comment)
@dustymabe
Copy link
Member

ok I almost wonder if AWS backed out their firmware update. The test started failing in our rawhide stream today and I'm poking around and it's not passing on AMIs that it was previously passing on. For example ami-0df8b995ad14e1d2d from #1306 (comment) no longer passes the test:

Fedora CoreOS 37.20221015.1.0
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/tag/coreos

[core@ip-172-31-69-207 ~]$ ls /dev/nvm*
/dev/nvme0

Where we should see nvme0n1 in there.

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Nov 2, 2022
It seems as if the fix that AWS had applied is no longer working.
See coreos/fedora-coreos-tracker#1306 (comment)
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Nov 2, 2022
It seems as if the fix that AWS had applied is no longer working.
See coreos/fedora-coreos-tracker#1306 (comment)
@dustymabe
Copy link
Member

This is still a problem as of today. We need to extend the snooze for this test again.

@davdunc
Copy link
Contributor

davdunc commented Dec 5, 2022

Looking into this at AWS internally. I have an internal tracking ticket.

@cgwalters cgwalters pinned this issue Dec 6, 2022
@cgwalters cgwalters added cloud* related to public/private clouds component/kernel labels Dec 6, 2022
@c4rt0 c4rt0 self-assigned this Dec 12, 2022
@c4rt0 c4rt0 added the jira for syncing to jira label Dec 12, 2022
c4rt0 added a commit to c4rt0/fedora-coreos-config that referenced this issue Dec 14, 2022
It seems as if the fix that AWS had applied is no longer working.
See: coreos/fedora-coreos-tracker#1306 (comment)
dustymabe pushed a commit to coreos/fedora-coreos-config that referenced this issue Dec 14, 2022
It seems as if the fix that AWS had applied is no longer working.
See: coreos/fedora-coreos-tracker#1306 (comment)
@dustymabe
Copy link
Member

This is still failing with whatever environment AWS has as of today (2023-01-11) and kernel-6.0.18-300.fc37. Are we going to get NVME disks back on AWS anytime soon?

@dustymabe
Copy link
Member

This is still failing with whatever environment AWS has as of today (2023-02-10) and kernel-6.1.9-200.fc37.

@davdunc
Copy link
Contributor

davdunc commented Feb 10, 2023

Okay. I am taking this back to the AWS EBS team for review.

@dustymabe
Copy link
Member

It appears this is passing our test now. In the most recent next-devel run for 38.20230310.10.0:

[2023-03-10T21:27:39.175Z] --- PASS: non-exclusive-test-bucket-0 (205.82s)
[2023-03-10T21:27:39.175Z]     --- PASS: non-exclusive-test-bucket-0/ext.config.platforms.aws.assert-xen (2.10s)
[2023-03-10T21:27:39.175Z]     --- PASS: non-exclusive-test-bucket-0/ext.config.platforms.aws.nvme (2.12s)

Hopefully it's really resolved this time!

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Mar 10, 2023
Recent tests are passing. Hopefull the issue in the AWS environment
is fully resolved now.

Closes coreos/fedora-coreos-tracker#1306 (comment)
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Mar 10, 2023
Recent tests are passing. Hopefully the issue in the AWS environment
is fully resolved now.

Closes coreos/fedora-coreos-tracker#1306 (comment)
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Mar 11, 2023
Recent tests are passing. Hopefully the issue in the AWS environment
is fully resolved now.

Closes coreos/fedora-coreos-tracker#1306 (comment)
@c4rt0 c4rt0 removed their assignment Mar 11, 2023
@dustymabe dustymabe unpinned this issue Mar 15, 2023
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This test ensures that if an nvme device exists it is accessible.
See coreos/fedora-coreos-tracker#1306

This commit also denylists the test with a snooze for the next few
weeks. The hope is that Amazon does the firmware rollout soon.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
It looks like our F37+ streams pass this test now [1] so let's also
only deny the test on streams where it's known to fail.

[1] coreos/fedora-coreos-tracker#1306 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
It seems as if the fix that AWS had applied is no longer working.
See coreos/fedora-coreos-tracker#1306 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
It seems as if the fix that AWS had applied is no longer working.
See: coreos/fedora-coreos-tracker#1306 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Recent tests are passing. Hopefully the issue in the AWS environment
is fully resolved now.

Closes coreos/fedora-coreos-tracker#1306 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This test ensures that if an nvme device exists it is accessible.
See coreos/fedora-coreos-tracker#1306

This commit also denylists the test with a snooze for the next few
weeks. The hope is that Amazon does the firmware rollout soon.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
It looks like our F37+ streams pass this test now [1] so let's also
only deny the test on streams where it's known to fail.

[1] coreos/fedora-coreos-tracker#1306 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
It seems as if the fix that AWS had applied is no longer working.
See coreos/fedora-coreos-tracker#1306 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
It seems as if the fix that AWS had applied is no longer working.
See: coreos/fedora-coreos-tracker#1306 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Recent tests are passing. Hopefully the issue in the AWS environment
is fully resolved now.

Closes coreos/fedora-coreos-tracker#1306 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants