New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRUB memory corruption in older Container Linux releases #2400

Closed
bgilbert opened this Issue Apr 10, 2018 · 6 comments

Comments

Projects
None yet
2 participants
@bgilbert
Member

bgilbert commented Apr 10, 2018

Issue Report

Bug

Container Linux Version

Introduced in coreos/grub@a2599ab in 926.0.0.
Fixed in coreos/grub#53 in alpha 1744.0.0.

Environment

AMD64 booting via BIOS.

Expected Behavior

GRUB does not corrupt memory.

Actual Behavior

GRUB corrupts memory. Sometimes this causes boot failures.

Reproduction Steps

See #2284.

Other Information

Because GRUB is not updated after installation, OS upgrades will not fix the bug. The open question is whether the bug is likely to cause machines to break in the future. We have no evidence that this has happened so far, but we don't really know.

Further characterize the potential impact of the bug on existing systems, and consider adding a coreos-postinst hack to fix the broken GRUB on upgrade. This could be done by installing a fixed GRUB version (seems very risky) or by applying a binary patch to fix the incorrect offset (seems somewhat less risky).

@ajeddeloh

This comment has been minimized.

ajeddeloh commented Apr 10, 2018

Capturing some offline discussion and experimentation:

Binary patching would involve:

  • gunzip'ing /boot/coreos/grub/i386-pc/linux.mod
  • calculating &linux_params + sizeof(lh) (the wrong value) in the extracted linux file
  • finding the mov $eax <calculated value> instruction (and verifying it's the only one) right before the call to grub_memmove()
  • calculating the correct value
  • overwriting the instruction
  • gzip'ing linux back to linux.mod

This should not be done client side. If we do binary patch, we should build and thoroughly test it with each released version of grub, then the client side should apply the correct update based on either the original install version of a hash of linux.mod to verify the update is supported.

To find the incorrect and correct values, you can download the binpkg for grub and use the debug symbols in <binpkgroot>/usr/lib/grub/linux.module to find the location of linux_params, its size, and the size of lh.

Another option is to nop-out the call to grub_memmove since it's not doing anything helpful right now, but I haven't explored this in detail yet,

It's also worth noting the binkpkg should be the one for the SDK, since we use the SDK's grub, not the board's.

@ajeddeloh

This comment has been minimized.

ajeddeloh commented Apr 13, 2018

I went ahead and downloaded all the binpkgs from 926 - 1722 (183 releases) and did some analysis:

  • All affected versions have debug symbols adequate for calculating the incorrect and correct addresses
    • Some versions have multiple grub binkpgs (why?). In some (all? I haven't checked) of these, only 1 of the binpkgs have the buggy code
  • linux.module always has only one mov $eax <bad addr> instruction. Manual inspection of the instruction and the ones after it confirm it is always the one we want.
  • There are 6 unique variants on the bad instruction, but ~60 different linux.module's in the binpkgs. There may be fewer unique /boot/grub/i386-pc/linux.mod's on the filesystem.
  • The linux_params struct is always the same size (0x3fc)
  • 1688.5.3 and 1688.4.0 have the same bad instruction at the same location (i.e. it's not a difference in the instruction that caused 1688.4.0 to fail without entering the menu editor). @bgilbert was curious about this
@ajeddeloh

This comment has been minimized.

ajeddeloh commented Apr 17, 2018

Things needed before binary patching:

  • Determine if the behavior change is breaking, and to what degree; decide if it should be nop'd out or if the operand should be fixed
    • It's not - currently the memmove isn't strictly needed, but future versions of the linux 32 bit boot protocol might require it. Right now, it can be nop'd out or fixed and neither will change behavior of existing systems.
  • Find the bad instuctions in every linux.mod, where they are and what they should be patched to
    • What they should be patched to assuming the operand should be fixed
    • What they should be patched to assuming nop'ing out the call to grub_memmove
  • Have an "esacpe hatch" for disabling the behavior so if a user reverts the change it wont re-apply next update
  • Write test for applying the script
    • Test on all supported platforms for all original install versions with the bug
      • aws
      • do Can't spin up old instances
      • gce (can only go back as far as 1122.0.0
      • azure
      • qemu
  • Determine why this only hits certain builds

ajeddeloh added a commit to ajeddeloh/coreos-overlay that referenced this issue May 3, 2018

coreos-base/update_engine: bump to 0.4.7
This includes a postinst update to fix broken grubs. See
coreos/bugs#2400.
@ajeddeloh

This comment has been minimized.

ajeddeloh commented May 3, 2018

@ajeddeloh ajeddeloh closed this May 3, 2018

@ajeddeloh

This comment has been minimized.

ajeddeloh commented May 3, 2018

Reopening until it makes its way to stable.

@bgilbert bgilbert reopened this May 4, 2018

@bgilbert

This comment has been minimized.

Member

bgilbert commented May 11, 2018

We're not planning to backport this to the current beta or stable branches; we'll let the change promote through the channels in the usual fashion. GRUB itself is fixed in all three release branches (though the fix is not yet included in a stable release) so there will be no new affected releases. Thus I'll close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment