Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

"invalid GPT signature" after automatic upgrade from 668.2.0 to 675.0.0 #356

Closed
treed opened this issue May 8, 2015 · 19 comments
Closed

Comments

@treed
Copy link

treed commented May 8, 2015

I installed 668.2.0 to baremetal earlier this week; starting yesterday, it's been trying to automatically upgrade and hosing the install.

After the automatic reboot, GRUB gives three options:

CoreOS default says:

error: invalid GPT signature.
error: no such partition.

CoreOS USR-A and CoreOS USB-B just give the latter message. This is on a Dell PowerEdge R420. Let me know if you need to know anything else about it.

@treed
Copy link
Author

treed commented May 8, 2015

I just tested and I can do a bare-metal install of 675.0.0, so it's specifically the upgrade.

@crawford
Copy link
Contributor

Is this the first you've used CoreOS on the R420? I'm willing to bet this is related to your other bug (#340).

@treed
Copy link
Author

treed commented May 12, 2015

Following my workaround for #340, this was my first working install on this hardware, yes.

Given the workaround (disabling VT in the BIOS), I am able to use the system normally, use coreos-install to put CoreOS on the HD, and then boot into that HD-based install. It comes up just fine and is able to use the disks.

It's certainly possible that this is related to that bug, but I'd rate it as fairly unlikely for a system that has already booted off of the disk in question.

I guess I'd probably start debugging this by enumerating the differences in how coreos-install interacts with the disk compared with how the upgrade process interacts with the disk.

@treed
Copy link
Author

treed commented Jul 16, 2015

This is still happening as of 735.0.0 -> 738.1.0.

Here's a copy of the update_engine log: http://sprunge.us/eWLf

There are three errors in the log, all in OmahaResponseHandlerAction. I don't know how important those errors might be, but they seem worth pointing out.

I'm kind of at a loss to explain how the upgrade installer can hose the GPT as it does while other disk access works just fine.

@marineam
Copy link

@treed: Is the disk controller a raid card? We've recently gotten two independent reports of the GPT getting corrupted when using addon raid cards. We haven't been able to dig into this in depth yet, still need to acquire some problematic hardware oursevles to figure exactly when the corruption occurs but one theory is that the BIOS calls to write to the GPT in the bootloader do not work correctly on some devices. (a counter on the USR partition about to be booted is decremented to indicate that a boot was attempted)

If it is that write in the bootloader at fault it may be worth trying to boot the system in UEFI mode if you are not already, just in case writing via UEFI APIs do work properly. Beyond that I am going to need to get personal with some real hardware to figure out what is going on.

/cc @brianredbeard

@marineam
Copy link

@treed also, in at least one other case only the primary GPT was bad, so recovering from the backup worked. Booting to a coreos ISO or PXE image and running cgpt repair /dev/sda or similar do that.

@treed
Copy link
Author

treed commented Jul 16, 2015

It is using a PERC card of some kind. I can look up the exact model tomorrow. I'll also check to see what the BIOS settings are and play around with that some.

If you, or someone else who'd like to, in the SF bay area, I might be able to arrange a coordinated debugging session with the hardware, or similar hardware. Our offices are in Mountain View/SF, and the servers are in Oakland.

@MikeRoetgers
Copy link

@treed We had the same problem, very annoying to debug. After some trying we found a relatively simple workaround that worked for us. CoreOS was installed on multiple machines to /dev/sda, which was a RAID5 with sometimes ~2.5TB and sometimes ~5.5TB in size. Whenever it restarted after an upgrade, we ran into "invalid GPT signature". What solved it for us was changing our RAID config and go with a smaller set for CoreOS (2 disks in raid 1, less than 1 TB in size). The problem disappeared on all machines.

@treed
Copy link
Author

treed commented Jul 16, 2015

Okay, so a few results:

The card is a PERC H310 configured with a 5.5TB RAID 5.

Running cgpt repair claimed to fix things:

Primary Header is updated.
Primary Entries is updated.

But attempting to boot after that didn't even make it to GRUB. Now it just sits there after "PXE-M0F: Exiting Broadcom PXE ROM" after being instructed to boot from local disk.

It was previously configured for BIOS over UEFI. I've switched it to UEFI and am trying to test with that configuration.

@MikeRoetgers Thanks for that. If I can't get it working with UEFI I might give that a shot. Unfortunate, though.

@marineam
Copy link

@treed ok, at this point BIOS mode may not be working if the MBR boot code also got clobbered. UEFI mode should work though.

@treed
Copy link
Author

treed commented Jul 17, 2015

nod I'm currently struggling to even get this thing booting via UEFI. :(

I've got ipxe.lkrn being served up via pxe, with embedded instructions to boot CoreOS with a cloud-config that runs coreos-install and then reboots.

I found that I had to use a different syslinux for efi, but now it boots, pulls syslinux.efi and gets an IP and then just... does nothing. Trying to figure out if I need a different ipxe file or something.

@marineam
Copy link

treed: since there currently isn't a backup copy of the MBR code in the image (and no grub-install to generate one either) you can try to boot again via MBR by writing this to the disk:

wget https://storage.googleapis.com/users.developer.core-os.net/marineam/mbr.bin
dd if=mbr.bin of=/dev/sda

But if it is legacy bios mode that triggers the bug you'll get stuck again once a new update comes. So trying to boot via UEFI mode is still worth a shot.

@treed
Copy link
Author

treed commented Jul 21, 2015

I still haven't been able to get PXE working with UEFI, but I can verify that upgrading works fine if I make the root volume a single 2TB drive.

@wkruse
Copy link

wkruse commented Nov 23, 2015

I am experiencing pretty much the same issue as @treed on Dell PowerEdge R630 with PERC H730 Mini (Embedded) Integrated RAID Controller with RAID 0 (don't ask) and single 3,7 TB disk. No UEFI. PXE boot and install to disk of 766.4.0 works without problems. Update to 766.5.0 (update-engine or manual) breaks it.

sudo cgpt show /dev/sda
       start        size    part  contents
           0           1          Unknown
           1           1 INVALID  Pri GPT header
           2          32 INVALID  Pri GPT table
  7807959007          32          Sec GPT table
        4096      262144       1  Label: "EFI-SYSTEM"
                                  Type: EFI System Partition
                                  UUID: 826ED773-DC1E-4214-AE22-95F37F00BA41
                                  Attr: Legacy BIOS Bootable
      266240        4096       2  Label: "BIOS-BOOT"
                                  Type: BIOS Boot Partition
                                  UUID: ACA99593-EC92-47C9-B513-E0E323A7D0B2
      270336     2097152       3  Label: "USR-A"
                                  Type: Alias for coreos-rootfs
                                  UUID: 7130C94A-213A-4E5A-8E26-6CCE9662F132
                                  Attr: priority=1 tries=0 successful=1
     2367488     2097152       4  Label: "USR-B"
                                  Type: Alias for coreos-rootfs
                                  UUID: E03DD35C-7C2D-4A47-B3FE-27F15780A57C
                                  Attr: priority=2 tries=1 successful=0
     4464640      262144       6  Label: "OEM"
                                  Type: Alias for linux-data
                                  UUID: 2B61E089-03FF-4CE7-A9DE-06560DD3A323
     4726784      131072       7  Label: "OEM-CONFIG"
                                  Type: CoreOS reserved
                                  UUID: 2AD50847-5108-439A-81FE-4A3EF33977DD
     4857856  7803101151       9  Label: "ROOT"
                                  Type: CoreOS auto-resize
                                  UUID: 2BA5EFDF-B69C-45AC-9DD6-6C55BC5D9941
  7807959039           1          Sec GPT header

WARNING: one of the GPT header/entries is invalid, please run 'cgpt repair'

Booting PXE and repairing doesn't help. Current "workaround" is to disable updates and reboots.

@wkruse
Copy link

wkruse commented Dec 7, 2015

835.8.0 with UEFI works like a charm.

@crawford
Copy link
Contributor

@treed were you ever able to get UEFI working? I'm curious if that is enough to allow larger RAID arrays.

@treed
Copy link
Author

treed commented Jan 26, 2016

It's been a while but I don't think so. I ended up redoing my RAID so that
I had a single disk root volume and a 3-disk RAID-5 for /var/lib/docker

On Mon, Jan 25, 2016 at 7:18 PM Alex Crawford notifications@github.com
wrote:

@treed https://github.com/treed were you ever able to get UEFI working?
I'm curious if that is enough to allow larger RAID arrays.


Reply to this email directly or view it on GitHub
#356 (comment).

@crawford
Copy link
Contributor

OK. I'm going to close this one since it looks like UEFI works. Feel free to re-open it if you run into trouble again.

@eweidner
Copy link

I just ran into this issue on a Dell R720 after an upgrade to 899.17.0 (Stable) with a 4.7 TB HW Raid partition and Legacy Bios. This issue doesn't seem to be recoverable. Luckily I was still in testing and therefore I just rebuilt with UEFI turned on but also repartitioning to have a smaller main just in case. I recommend that this is listed in the known issues on the bare-metal install instructions somewhere.

Thanks,

Eric

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants