Skip to content
This repository has been archived by the owner. It is now read-only.

GRUB memory corruption in older Container Linux releases #2400

bgilbert opened this issue Apr 10, 2018 · 6 comments

GRUB memory corruption in older Container Linux releases #2400

bgilbert opened this issue Apr 10, 2018 · 6 comments


Copy link

@bgilbert bgilbert commented Apr 10, 2018

Issue Report


Container Linux Version

Introduced in coreos/grub@a2599ab in 926.0.0.
Fixed in coreos/grub#53 in alpha 1744.0.0.


AMD64 booting via BIOS.

Expected Behavior

GRUB does not corrupt memory.

Actual Behavior

GRUB corrupts memory. Sometimes this causes boot failures.

Reproduction Steps

See #2284.

Other Information

Because GRUB is not updated after installation, OS upgrades will not fix the bug. The open question is whether the bug is likely to cause machines to break in the future. We have no evidence that this has happened so far, but we don't really know.

Further characterize the potential impact of the bug on existing systems, and consider adding a coreos-postinst hack to fix the broken GRUB on upgrade. This could be done by installing a fixed GRUB version (seems very risky) or by applying a binary patch to fix the incorrect offset (seems somewhat less risky).

Copy link

@ajeddeloh ajeddeloh commented Apr 10, 2018

Capturing some offline discussion and experimentation:

Binary patching would involve:

  • gunzip'ing /boot/coreos/grub/i386-pc/linux.mod
  • calculating &linux_params + sizeof(lh) (the wrong value) in the extracted linux file
  • finding the mov $eax <calculated value> instruction (and verifying it's the only one) right before the call to grub_memmove()
  • calculating the correct value
  • overwriting the instruction
  • gzip'ing linux back to linux.mod

This should not be done client side. If we do binary patch, we should build and thoroughly test it with each released version of grub, then the client side should apply the correct update based on either the original install version of a hash of linux.mod to verify the update is supported.

To find the incorrect and correct values, you can download the binpkg for grub and use the debug symbols in <binpkgroot>/usr/lib/grub/linux.module to find the location of linux_params, its size, and the size of lh.

Another option is to nop-out the call to grub_memmove since it's not doing anything helpful right now, but I haven't explored this in detail yet,

It's also worth noting the binkpkg should be the one for the SDK, since we use the SDK's grub, not the board's.

Copy link

@ajeddeloh ajeddeloh commented Apr 13, 2018

I went ahead and downloaded all the binpkgs from 926 - 1722 (183 releases) and did some analysis:

  • All affected versions have debug symbols adequate for calculating the incorrect and correct addresses
    • Some versions have multiple grub binkpgs (why?). In some (all? I haven't checked) of these, only 1 of the binpkgs have the buggy code
  • linux.module always has only one mov $eax <bad addr> instruction. Manual inspection of the instruction and the ones after it confirm it is always the one we want.
  • There are 6 unique variants on the bad instruction, but ~60 different linux.module's in the binpkgs. There may be fewer unique /boot/grub/i386-pc/linux.mod's on the filesystem.
  • The linux_params struct is always the same size (0x3fc)
  • 1688.5.3 and 1688.4.0 have the same bad instruction at the same location (i.e. it's not a difference in the instruction that caused 1688.4.0 to fail without entering the menu editor). @bgilbert was curious about this
Copy link

@ajeddeloh ajeddeloh commented Apr 17, 2018

Things needed before binary patching:

  • Determine if the behavior change is breaking, and to what degree; decide if it should be nop'd out or if the operand should be fixed
    • It's not - currently the memmove isn't strictly needed, but future versions of the linux 32 bit boot protocol might require it. Right now, it can be nop'd out or fixed and neither will change behavior of existing systems.
  • Find the bad instuctions in every linux.mod, where they are and what they should be patched to
    • What they should be patched to assuming the operand should be fixed
    • What they should be patched to assuming nop'ing out the call to grub_memmove
  • Have an "esacpe hatch" for disabling the behavior so if a user reverts the change it wont re-apply next update
  • Write test for applying the script
    • Test on all supported platforms for all original install versions with the bug
      • aws
      • do Can't spin up old instances
      • gce (can only go back as far as 1122.0.0
      • azure
      • qemu
  • Determine why this only hits certain builds
ajeddeloh added a commit to ajeddeloh/coreos-overlay that referenced this issue May 3, 2018
This includes a postinst update to fix broken grubs. See
Copy link

@ajeddeloh ajeddeloh commented May 3, 2018

@ajeddeloh ajeddeloh closed this May 3, 2018
Copy link

@ajeddeloh ajeddeloh commented May 3, 2018

Reopening until it makes its way to stable.

@bgilbert bgilbert reopened this May 4, 2018
Copy link
Member Author

@bgilbert bgilbert commented May 11, 2018

We're not planning to backport this to the current beta or stable branches; we'll let the change promote through the channels in the usual fashion. GRUB itself is fixed in all three release branches (though the fix is not yet included in a stable release) so there will be no new affected releases. Thus I'll close.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants