Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Update from 1000.0.0 to 1010.1: first boot OK, then crash. But direct install of 1010.1 is OK (MBR corruption after update) #1238
CoreOS installed on HP MicroServer Gen 8, on ssd in bay 5, with RAID controller in B120i mode, 1 logical RAID0 volume with 1 disk (the ssd in bay 5). This is a common trick to boot MicroServer Gen8 from an internal drive and leave the four 3.5" bays for data disks.
Steps to reproduce the problem:
The "second boot always broken" seems similar to #1218.
How long was it between the first reboot and the second reboot? Looks like writing the successful flag to USR-B after the first boot didn't happen, update_engine waits 45 seconds after boot before doing that. I would generally assume the risky operation is actually during that first reboot when the bootloader updates the table to decrement the tries counter on USR-B prior to booting it. That second reboot shouldn't have done much different than the fresh install: using USR-A with 1000 on it since USR-B had neither a succesful flag nor any tries left on it. So I'm pretty confused what is up. Maybe there is some caching issue caused by writing to the partition table in the bootloader during first reboot that doesn't crop up and cause issues till the second. Perhaps dumping the table between first and second reboot would offer some insight.
Hello @marineam , new test as requested:
Same test, starting from Alpha 991.0.0
Ok, so this is interesting:
Booting the fresh install is properly fixing up the secondary GPT as expected. (In the initial image written to disk it is invalid, being located at the end of the original image size instead of the new larger disk).
But the reboot after applying the update is corrupting it again.
We have scattered a few
Unfortunately the exact manner of corruption isn't clear from the pretty-fied cgpt output and I'll need to dig through the grub code to see if I can figure out what may be going on. After a reboot, could you capture the raw partition tables? Something like:
As a workaround, after the reboot you should be able to do the following to get a system that will continue working:
Alternatively, if you boot the installed system in UEFI mode instead of legacy BIOS mode you may get different results (at least I'm assuming you are currently using BIOS mode)
Also, mostly for my own reference so I don't forget, EIP 0x7c64 corresponds to to the
So don't yet know how the system might come to try to execute that data byte.
Hello @marineam, many thanks for your support.
I collected requested disk sectors at several stages:
@marineam thanks for your effort, I really appreciate it.
New test with only SSD boot drive (/dev/sda) in SATA port 5, managed as a single disk RAID0 logical volume by the HP B120i controller.
Made this test to be sure the bootloader (grub) is not confused by the presence of additional drives not managed by the HP B120i controller.
Anyway the second boot of updated system will result in Red Screen of Death (with the same register values).
Attaching the requested disk sectors:
The partition table:
Last test for today: removed SSD in SATA port 5, put HDD in SATA port 1, configured controller in AHCI mode.
After update the system WORKS!
So this looks to me definitely a B120i RAID controller issue.
On of the search result is #125.
This time I have no disk sectors, sorry!
Update_engine log LGTM this time:
Crossing fingers and rebooting...
For the original data dump you posted here is diff of where the MBR gets corrupted during the reboot:
--- mbr-3.asm 2016-04-24 11:53:41.750425052 -0700 +++ mbr-4.asm 2016-04-24 11:53:41.753758229 -0700 @@ -5,10 +5,43 @@ Disassembly of section .data: 00000000 <.data>: - 0: eb 63 jmp 0x65 - 2: 90 nop + 0: 78 50 js 0x52 + 2: 23 00 and (%bx,%si),%ax + 4: 00 00 add %al,(%bx,%si) + 6: 00 00 add %al,(%bx,%si) + 8: 78 50 js 0x5a + a: 03 00 add (%bx,%si),%ax + c: 5c pop %sp + d: 00 00 add %al,(%bx,%si) + f: 00 3d add %bh,(%di) + 11: 9e sahf + 12: 35 e2 00 xor $0xe2,%ax ... + 1d: 00 00 add %al,(%bx,%si) + 1f: 00 08 add %cl,(%bx,%si) + ... + 2d: 00 00 add %al,(%bx,%si) + 2f: 00 b7 00 00 add %dh,0x0(%bx) + 33: 00 01 add %al,(%bx,%di) + 35: 00 00 add %al,(%bx,%si) + 37: 00 03 add %al,(%bp,%di) + 39: 00 00 add %al,(%bx,%si) + 3b: 00 00 add %al,(%bx,%si) + 3d: 00 00 add %al,(%bx,%si) + 3f: 00 80 50 23 add %al,0x2350(%bx,%si) + 43: 00 00 add %al,(%bx,%si) + 45: 00 00 add %al,(%bx,%si) + 47: 00 80 50 03 add %al,0x350(%bx,%si) + 4b: 00 00 add %al,(%bx,%si) + 4d: 00 00 add %al,(%bx,%si) + 4f: 00 08 add %cl,(%bx,%si) + 51: 4d dec %bp + 52: 00 00 add %al,(%bx,%si) + 54: 00 00 add %al,(%bx,%si) + 56: 00 00 add %al,(%bx,%si) + 58: 1d a9 11 sbb $0x11a9,%ax + 5b: 49 dec %cx + 5c: 00 10 add %dl,(%bx,%si) - 5b: 80 00 10 addb $0x10,(%bx,%si) 5e: 04 00 add $0x0,%al 60: 00 00 add %al,(%bx,%si) 62: 00 00 add %al,(%bx,%si)
The key point being the all critical
Hello @marineam I made another test and fixed the system, thanks to your suggestions:
During the first boot of 1010.1.0, after the cgpt running 45 sec after the boot, I did:
Then I net-booted with PXE (maybe not necessary) and edited the MBR with dd to replace the corrupted one with the good one from start-3.bin (first 512 bytes of start-3.bin).
This has fixed the system and next reboots of 1010.1.0 are stable, with the MBR left unchanged at every reboot (checked with MD5 of first 512 bytes at each reboot).
So, to recap:
Also, I've found #159. Could it be related in some way?
As a temporary fix I'll write some dependency units for the update process, in order to fix the partition table and restore MBR in case of corruption.
Just a confirmation of the hypothesis of MBR corruption during boot of updated system.
mbr-3.bin: good MBR in system 1000.0.0
The system is again repaired with:
and the next reboot is ok.
referenced this issue
Sep 22, 2016
Status update on this, after a long series of improvements and bug fixes to our GPT code in GRUB I've finally posted the one that fixes booting on this particular system/configuration. There were two issues: the first which I fixed a few weeks ago was that the GPT code didn't properly repair invalid data, leading it to potentially corrupt arbitrary portions of disk. The second issue was the odd firmware configuration of this system that lead to reading invalid data in the first place. This fix should be a suitable workaround: coreos/grub#39
Configuring a single-disk RAID 0 array on this system reserves 32MB or more at the end of the disk. So GRUB, using the BIOS interface, thinks the backup GPT should be in a different location and is unable to access the real location. Meanwhile the OS is writing the backup GPT and potentially filesystem data to a portion of disk the firmware thinks is reserved for the RAID array. I didn't see any sign of the firmware attempting to write to this area which seems plausible since the proprietary kernel driver is probably just implementing software RAID, but hard to know if it can be trusted to never read/write from there.
So future CoreOS releases should work with this setup but I can't say I really recommend it.