-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some AWS instance types fail to get networking starting with 35.20211226.20.0 #1066
Comments
It's also notable that the This "Jenkins Timeout" versus a hard failure caused us to think that Jenkins was just having trouble and ignoring the behavior. Combine it with the Holiday's where we aren't paying attention as much and we get this perfect storm. |
I performed a bisect to find when the m4.large instances started failing. It seems that The change in
|
I took the latest |
I ran a few more tests:
|
One thing to note here is that @mike-nguyen is reporting that
So there's at least something to investigate there. |
This was still hung after 20 minutes (m4.large):
I ran an m5.large instance with the same stream and it booted up right away. Seems like
|
Weird. Here's my terminal output from the run:
|
And running it again failed. So yeah. Looks like we just got lucky when we ran |
Some more information. The difference between Fedora Cloud Base and Fedora 35 Cloud Base - ami-08b4ee602f76bff79 - (from release day)
After full update (
On latest FCOS stable
In Fedora Cloud Base it's using the Indeed in FCOS we have enhanced networking enabled so that's why we're using the
So it must be some issue with the driver in the newer kernel.
you can't disable it. |
Until we get a fix we need to investigate if we can properly denylist the If we can't we'll probably need to revert to the older kernel in |
I got similar results overnight. I ran the |
For completeness, I enabled enhanced networking on Fedora Cloud Base 35 from release day (ami-08b4ee602f76bff79) and I was not able to SSH to the VM afterwards. Here are the steps:
|
I finally managed to get into a
In case it ends up mattering this particular instance is in |
After a bunch more tries I got up another instance in |
AWS Internal Review Reference: tt:V502410588 |
For now we'll revert the kernel. Options:
Looking at CVE info (details below) there were no kernel CVEs fixed between
|
Newer kernels seem to have an issue with enhanced networking on some AWS instance types so they don't boot. Let's pin on the older kernel for now while we investigate and find a proper solution. See coreos/fedora-coreos-tracker#1066
pin PR: coreos/fedora-coreos-config#1416 |
Newer kernels seem to have an issue with enhanced networking on some AWS instance types so they don't boot. Let's pin on the older kernel for now while we investigate and find a proper solution. See coreos/fedora-coreos-tracker#1066
I just built a |
I spoke too soon. It seems like the failure rate is more like 50-60% though, rather than 95%, which is why I thought it was fixed. |
Newer kernels seem to have an issue with enhanced networking on some AWS instance types so they don't boot. Let's pin on the older kernel for now while we investigate and find a proper solution. See coreos/fedora-coreos-tracker#1066 (cherry picked from commit 467af82)
Newer kernels seem to have an issue with enhanced networking on some AWS instance types so they don't boot. Let's pin on the older kernel for now while we investigate and find a proper solution. See coreos/fedora-coreos-tracker#1066 (cherry picked from commit 467af82)
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
related issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006346 |
Upstream patch: https://lore.kernel.org/linux-pci/87tuaduxj5.ffs@tglx |
We did the revert to get our streams back in shape and since then none of our tests have failed so I assumed necessary fixes landed in the kernel. Closing this out. |
Newer kernels seem to have an issue with enhanced networking on some AWS instance types so they don't boot. Let's pin on the older kernel for now while we investigate and find a proper solution. See coreos/fedora-coreos-tracker#1066
We shipped out updates this morning to `testing` and `next` with a downgraded kernel that matches what is already in `stable`. For our upcoming regularly scheduled releases let's at least get to the latest possible known good kernel. See coreos/fedora-coreos-tracker#1066
This kernel has a revert [1] that allows us to get AWS instance types working again [2] and also is newer so it includes a fix for recent CVE-2022-0185 [3]. [1] https://gitlab.com/cki-project/kernel-ark/-/commit/63aede4 [2] coreos/fedora-coreos-tracker#1066 (comment) [3] https://bugzilla.redhat.com/show_bug.cgi?id=2042052
This allows us to get the latest kernel-5.16.12-200.fc35. Moving to a kernel newer than 5.16.11 picks up the fix fo CVE-2022-0847. We're able to do this because the Fedora kernel maintainers agreed to again pick up a revert that allows us to not regress on some AWS instance types (coreos/fedora-coreos-tracker#1066). Closes coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
Newer kernels seem to have an issue with enhanced networking on some AWS instance types so they don't boot. Let's pin on the older kernel for now while we investigate and find a proper solution. See coreos/fedora-coreos-tracker#1066
We shipped out updates this morning to `testing` and `next` with a downgraded kernel that matches what is already in `stable`. For our upcoming regularly scheduled releases let's at least get to the latest possible known good kernel. See coreos/fedora-coreos-tracker#1066
This kernel has a revert [1] that allows us to get AWS instance types working again [2] and also is newer so it includes a fix for recent CVE-2022-0185 [3]. [1] https://gitlab.com/cki-project/kernel-ark/-/commit/63aede4 [2] coreos/fedora-coreos-tracker#1066 (comment) [3] https://bugzilla.redhat.com/show_bug.cgi?id=2042052
This allows us to get the latest kernel-5.16.12-200.fc35. Moving to a kernel newer than 5.16.11 picks up the fix fo CVE-2022-0847. We're able to do this because the Fedora kernel maintainers agreed to again pick up a revert that allows us to not regress on some AWS instance types (coreos/fedora-coreos-tracker#1066). Closes coreos/fedora-coreos-tracker#1118
This is the first kernel with the most recent revert that allows us to not regress on some AWS instance types [1]. Because it is newer than 5.16.11 it also allows for us to pick up the fix to CVE-2022-0847 [2]. [1] coreos/fedora-coreos-tracker#1066 [2] coreos/fedora-coreos-tracker#1118
Describe the bug
It seems as if networking never comes up enough to reach the metadata service:
Full log: i-0ab8a90bd2a4814db.log
You can reproduce easily with something like:
The instance never boots. Checking out the system log shows the infinite Ignition attempts.
The text was updated successfully, but these errors were encountered: