Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"GRUB loading Read Error" post-install on bare-metal (Dell 1950) #1147

Closed
treed opened this issue Mar 2, 2016 · 19 comments

Comments

Projects
None yet
3 participants
@treed
Copy link

commented Mar 2, 2016

This is an apparent regression between the current beta (899.9.0) and the current alpha (970.1.0).

When installing alpha onto our Dell 1950s (Install by running coreos-install after booting from USB stick), the system fails to boot with "GRUB Loading Read Error".

This happened consistently across 4 1950s, but does not happen on an R420.

When installing beta instead, the error does not occur and the 1950s seem to work fine with beta.

Another maybe relevant difference between the 1950s and the R420 is that the former have simple SAS cards, where the latter has a MegaSAS raid controller.

I can provide whatever output is needed to compare between the two machines.

If it's possible to get a list of versions between those two, I can also attempt to bisect and more tightly nail down when the regression occurred.

@crawford

This comment has been minimized.

Copy link
Member

commented Mar 3, 2016

/cc @mjg59

@mjg59

This comment has been minimized.

Copy link

commented Mar 3, 2016

Are these being booted via UEFI or via BIOS?

@treed

This comment has been minimized.

Copy link
Author

commented Mar 3, 2016

As far as I can tell these don't support EFI of any kind, so I believe they're being booted via BIOS. I was unable to find any options along these lines in the BIOS setup.

(They're kinda old.)

@treed

This comment has been minimized.

Copy link
Author

commented Mar 3, 2016

To quantify "kinda old", one such we bought on April 13th 2007.

@mjg59

This comment has been minimized.

Copy link

commented Mar 3, 2016

Do you have the exact model?

@treed

This comment has been minimized.

Copy link
Author

commented Mar 3, 2016

The four that exhibit this problem are Dell 1950s. We've since upgraded them with the best possible CPU that the motherboard will accept and more RAM, but AFAIK the hardware is otherwise stock.

@treed

This comment has been minimized.

Copy link
Author

commented Mar 3, 2016

This is the model of the SAS controller:

02:08.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS (rev 01)

@treed

This comment has been minimized.

Copy link
Author

commented Mar 3, 2016

Here's one of them in Dell's support system: http://www.dell.com/support/home/us/en/04/product-support/servicetag/2CN3RC1/configuration

Based on that information, we must have also upgraded the hard drives. The CPUs were upgraded to Xeon-E5345, and the RAM was upgraded to 32GiB.

@mjg59

This comment has been minimized.

Copy link

commented Mar 3, 2016

Ok, this is probably an issue in the TPM code. Let me take a look. Do you know if these systems have TPMs?

@treed

This comment has been minimized.

Copy link
Author

commented Mar 3, 2016

I'm unsure. How would I tell?

@mjg59

This comment has been minimized.

Copy link

commented Mar 3, 2016

Does /dev/tpm0 exist?

@treed

This comment has been minimized.

Copy link
Author

commented Mar 3, 2016

Nope. Checked the R420 that is successfully running alpha and it also doesn't exist there.

@mjg59

This comment has been minimized.

Copy link

commented Mar 3, 2016

The relevant code shouldn't have changed here, but clearly it has. I'm digging into this in more detail.

@treed

This comment has been minimized.

Copy link
Author

commented Mar 4, 2016

So I did a bit of bisecting:

970.1.0 exhibits the problem
933.0.0 exhibits the problem
926.0.0 exhibits the problem
921.0.0 does not
899.9.0 does not

It seems that the problem came in between 921 and 926. Worth noting that 926 included an upgraded GRUB (to address a CVE, but maybe other changes in with it?)

@mjg59

This comment has been minimized.

Copy link

commented Mar 24, 2016

This should be fixed in 998, which will be released in the near future. Sorry for the lack of updates, this turned out to be subtle.

@treed

This comment has been minimized.

Copy link
Author

commented Mar 24, 2016

Oh, sweet. I've been heads down on other stuff. Thanks for taking care of it! I'll test out 998 when it comes out and report back.

@crawford

This comment has been minimized.

Copy link
Member

commented Mar 24, 2016

@mjg59 which change actually fixed this behavior?

@mjg59

This comment has been minimized.

Copy link

commented Apr 22, 2016

@mjg59 mjg59 closed this Apr 22, 2016

@treed

This comment has been minimized.

Copy link
Author

commented May 16, 2016

Sorry it took me a while to get back to this, but I have tested and CoreOS 1010.3.0 both upgrades and installs fresh on these 1950s without issue. Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.