New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grub fails with message "free magic broken" or "alloc magic broken" #2284

Closed
ajeddeloh opened this Issue Dec 9, 2017 · 28 comments

Comments

Projects
None yet
@ajeddeloh

ajeddeloh commented Dec 9, 2017

Issue Report

Bug

Container Linux Version

1618.0.0 and newer

Environment

qemu, possibly others

Expected Behavior

appending to the kernel command line via editing the grub menu does no prevent the system from booting.

Actual Behavior

Appending to the kernel command line prevents the system from booting. grub exits with the error message: free magic is broken at 0x3cec8166: 0xcfc53e26 (actual addresses vary)

Reproduction Steps

  1. Build a Container Linux image
  2. run the generated coreos_production_qemu.sh script
  3. when the grub menu appears, edit the first entry and append something to the linux command like (e.g. "aaaaaaaaaaaaaaaaaaaaa")
  4. Press ^X to boot
  5. observe free magic is broken at... message

Other Information

This was introduced with changing /boot to fat32. It may be a bug in our image creation or in grub itself.

It is present with our current alpha images, and in master. Sometimes the error message is repeated, especially when building off master, although that is probably coincidence.

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Dec 9, 2017

Reviewed dosfstool's changelog, nothing jumped out at me as a culprit.

I mounted the boot partition from an affected and non-affected image:

# unaffected
$ sudo blkid -p /dev/loop1p1 | tr ' ' '\n'
/dev/loop1p1:
SEC_TYPE="msdos"
LABEL="EFI-SYSTEM"
UUID="D4CB-655D"
VERSION="FAT16"
TYPE="vfat"
USAGE="filesystem"
PART_ENTRY_SCHEME="gpt"
PART_ENTRY_NAME="EFI-SYSTEM"
PART_ENTRY_UUID="35606fa3-ee78-4adb-8b3e-4ac9297c1a91"
PART_ENTRY_TYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b"
PART_ENTRY_FLAGS="0x4"
PART_ENTRY_NUMBER="1"
PART_ENTRY_OFFSET="4096"
PART_ENTRY_SIZE="262144"
PART_ENTRY_DISK="7:1"
# affected
$ sudo blkid -p /dev/loop0p1 | tr ' ' '\n'
/dev/loop0p1:
LABEL="EFI-SYSTEM"
UUID="F3A1-D2E7"
VERSION="FAT32"
TYPE="vfat"
USAGE="filesystem"
PART_ENTRY_SCHEME="gpt"
PART_ENTRY_NAME="EFI-SYSTEM"
PART_ENTRY_UUID="b29a7470-e2de-4f17-b3af-f90fa7767608"
PART_ENTRY_TYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b"
PART_ENTRY_FLAGS="0x4"
PART_ENTRY_NUMBER="1"
PART_ENTRY_OFFSET="4096"
PART_ENTRY_SIZE="262144"
PART_ENTRY_DISK="7:0"

ajeddeloh commented Dec 9, 2017

Reviewed dosfstool's changelog, nothing jumped out at me as a culprit.

I mounted the boot partition from an affected and non-affected image:

# unaffected
$ sudo blkid -p /dev/loop1p1 | tr ' ' '\n'
/dev/loop1p1:
SEC_TYPE="msdos"
LABEL="EFI-SYSTEM"
UUID="D4CB-655D"
VERSION="FAT16"
TYPE="vfat"
USAGE="filesystem"
PART_ENTRY_SCHEME="gpt"
PART_ENTRY_NAME="EFI-SYSTEM"
PART_ENTRY_UUID="35606fa3-ee78-4adb-8b3e-4ac9297c1a91"
PART_ENTRY_TYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b"
PART_ENTRY_FLAGS="0x4"
PART_ENTRY_NUMBER="1"
PART_ENTRY_OFFSET="4096"
PART_ENTRY_SIZE="262144"
PART_ENTRY_DISK="7:1"
# affected
$ sudo blkid -p /dev/loop0p1 | tr ' ' '\n'
/dev/loop0p1:
LABEL="EFI-SYSTEM"
UUID="F3A1-D2E7"
VERSION="FAT32"
TYPE="vfat"
USAGE="filesystem"
PART_ENTRY_SCHEME="gpt"
PART_ENTRY_NAME="EFI-SYSTEM"
PART_ENTRY_UUID="b29a7470-e2de-4f17-b3af-f90fa7767608"
PART_ENTRY_TYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b"
PART_ENTRY_FLAGS="0x4"
PART_ENTRY_NUMBER="1"
PART_ENTRY_OFFSET="4096"
PART_ENTRY_SIZE="262144"
PART_ENTRY_DISK="7:0"
@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Dec 9, 2017

Log from booting alpha. I went into the grub command line a set debug=all.

log.txt

Edit: built new images today (qemu, qemu_uefi); they did not trigger this bug. Tested with the qemu_uefi platform from current alpha and from the same image that failed with the qemu platform (1618.0.0) and it also did not trigger the bug. The normal qemu platform still does.

ajeddeloh commented Dec 9, 2017

Log from booting alpha. I went into the grub command line a set debug=all.

log.txt

Edit: built new images today (qemu, qemu_uefi); they did not trigger this bug. Tested with the qemu_uefi platform from current alpha and from the same image that failed with the qemu platform (1618.0.0) and it also did not trigger the bug. The normal qemu platform still does.

ajeddeloh added a commit to ajeddeloh/scripts-1 that referenced this issue Dec 18, 2017

Revert "disk_util: use FAT32 on ESP"
This reverts commit 7f058d6.

Reverting because of bug 2284 [1] where grub will sometimes fail do to
memory corruption. This is _not_ the cause of the bug, and the bug can
even be reproduced with this reversion, but it seems to occur less when
not using fat32.

[1] coreos/bugs#2284

ajeddeloh added a commit to ajeddeloh/scripts-1 that referenced this issue Dec 18, 2017

Revert "disk_util: use FAT32 on ESP"
This reverts commit 7f058d6.

Reverting because of bug 2284 [1] where grub will sometimes fail due to
memory corruption. This is _not_ the cause of the bug, and the bug can
even be reproduced with this reversion, but it seems to occur less when
not using fat32.

[1] coreos/bugs#2284
@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Dec 18, 2017

After about of week of digging I still haven't been able to find the problem. Here is what I did find:

  • grub crashes with bad free magic or bad alloc magic
  • The bug is present in all CL channels
  • It only occurs when bios booting (although this is not heavily tested)
  • It does not matter if /boot is fat16 or fat32
  • A given image/build will succeed or fail deterministically
  • Running build_image multiple times is deterministic; If the state of your repos produces a succeeding build, it will always succeed. If they produce a failing build they will always fail.
  • I do not know what makes a given build/image succeed or fail
  • images with the cmdline appended via kola mkimage work
  • On some images just entering the menu editor then booting (via ^X) is enough to trigger the bug. On other images you need to append to the menu item. Some images require several lines of text appended.
  • running grub_install.sh --target i386-pc --disk_image <path-to-image> fixes it
    • rerunning just the i386-pc target works
    • rerunning all the targets works
    • rerunning all targets except i386-pc does not work; i386-pc MUST be run
    • grub_install.sh can be run directly on the image or on a loop-mounted image
    • Rerunning grub_install.sh only changes the boot partition. All other partitions, the boot sector and the embedding area (sectors 34-4095) are unchanged.
    • Running grub-install.sh creates different boot partitions, probably due to timestamps.
  • reformatting /boot without rerunning grub-install causes it to fail to boot with "invalid gpt signature"
    • The grub config that gets loaded first looks for the uuid of the boot partition as originally created. This can be changed to search.fs_label for debugging purposes
    • If grub_install.sh is modified to use labels instead of uuids, then reformatting /boot and copying the files back works.
  • fcsk.fat sees no errors
  • dosfstools 4.1 and 3.x behave the same
  • testdisk claims the fat partitions are bad, but using testdisk to fix them has no effect
  • the stack is not colliding with the heap (edited debug messages to print an approximate sp)
  • editing files on a broken /boot has no effect
  • all of our modules (gptprio, efilinux, search_part_label, getenv, smbios) can be disabled in grub_install and the system will still boot, since they can be dynamically loaded from /boot
  • I have not been able to reproduce the issue with vanilla grub. Testing this requires removing all coreos modules in the CORE_MODULES list in grub_install.sh and always booting from USR-A (since the default menu entry uses gptprio)
  • Running git-bisect was not productive since there are sometimes builds that do not trigger the bug but still "carry" the bug.

ajeddeloh commented Dec 18, 2017

After about of week of digging I still haven't been able to find the problem. Here is what I did find:

  • grub crashes with bad free magic or bad alloc magic
  • The bug is present in all CL channels
  • It only occurs when bios booting (although this is not heavily tested)
  • It does not matter if /boot is fat16 or fat32
  • A given image/build will succeed or fail deterministically
  • Running build_image multiple times is deterministic; If the state of your repos produces a succeeding build, it will always succeed. If they produce a failing build they will always fail.
  • I do not know what makes a given build/image succeed or fail
  • images with the cmdline appended via kola mkimage work
  • On some images just entering the menu editor then booting (via ^X) is enough to trigger the bug. On other images you need to append to the menu item. Some images require several lines of text appended.
  • running grub_install.sh --target i386-pc --disk_image <path-to-image> fixes it
    • rerunning just the i386-pc target works
    • rerunning all the targets works
    • rerunning all targets except i386-pc does not work; i386-pc MUST be run
    • grub_install.sh can be run directly on the image or on a loop-mounted image
    • Rerunning grub_install.sh only changes the boot partition. All other partitions, the boot sector and the embedding area (sectors 34-4095) are unchanged.
    • Running grub-install.sh creates different boot partitions, probably due to timestamps.
  • reformatting /boot without rerunning grub-install causes it to fail to boot with "invalid gpt signature"
    • The grub config that gets loaded first looks for the uuid of the boot partition as originally created. This can be changed to search.fs_label for debugging purposes
    • If grub_install.sh is modified to use labels instead of uuids, then reformatting /boot and copying the files back works.
  • fcsk.fat sees no errors
  • dosfstools 4.1 and 3.x behave the same
  • testdisk claims the fat partitions are bad, but using testdisk to fix them has no effect
  • the stack is not colliding with the heap (edited debug messages to print an approximate sp)
  • editing files on a broken /boot has no effect
  • all of our modules (gptprio, efilinux, search_part_label, getenv, smbios) can be disabled in grub_install and the system will still boot, since they can be dynamically loaded from /boot
  • I have not been able to reproduce the issue with vanilla grub. Testing this requires removing all coreos modules in the CORE_MODULES list in grub_install.sh and always booting from USR-A (since the default menu entry uses gptprio)
  • Running git-bisect was not productive since there are sometimes builds that do not trigger the bug but still "carry" the bug.
@glevand

This comment has been minimized.

Show comment
Hide comment
@glevand

glevand Dec 19, 2017

@ajeddeloh That error comes from grub's memory manager code: https://github.com/coreos/grub/blob/master/grub-core/kern/mm.c#L205, so the problem seems to be memory corruption of the grub_mm_header_t.

A Web search for free magic is broken gives a lot of hits. I did find this one fix: http://lists.gnu.org/archive/html/grub-devel/2008-01/msg00830.html.

I suggest you put a hardware breakpoint on write to the grub_mm_header_t.magic address and see if you can figure out what is writing to it. You may need to find a qemu cpu emulation that supports hardware write breakpoint.

glevand commented Dec 19, 2017

@ajeddeloh That error comes from grub's memory manager code: https://github.com/coreos/grub/blob/master/grub-core/kern/mm.c#L205, so the problem seems to be memory corruption of the grub_mm_header_t.

A Web search for free magic is broken gives a lot of hits. I did find this one fix: http://lists.gnu.org/archive/html/grub-devel/2008-01/msg00830.html.

I suggest you put a hardware breakpoint on write to the grub_mm_header_t.magic address and see if you can figure out what is writing to it. You may need to find a qemu cpu emulation that supports hardware write breakpoint.

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Dec 22, 2017

Thanks for the suggestion, I'll give that try after the holidays.

ajeddeloh commented Dec 22, 2017

Thanks for the suggestion, I'll give that try after the holidays.

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Mar 28, 2018

This is currently affecting Container Linux 1688.4.0 Stable. If you encounter it, revert to 1632.3.0 Stable. We are working to release a new stable version that works around this issue.

ajeddeloh commented Mar 28, 2018

This is currently affecting Container Linux 1688.4.0 Stable. If you encounter it, revert to 1632.3.0 Stable. We are working to release a new stable version that works around this issue.

@andor44

This comment has been minimized.

Show comment
Hide comment
@andor44

andor44 Mar 28, 2018

Are the update servers disabled because of this by any chance? I created a new CL VM today from an older image and trying to force an update always results in the following:

$ update_engine_client -update
I0328 11:32:33.542951   913 update_engine_client.cc:247] Initiating update check and install.
I0328 11:32:33.544214   913 update_engine_client.cc:252] Waiting for update to complete.
LAST_CHECKED_TIME=1522236755
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_IDLE
NEW_VERSION=0.0.0
NEW_SIZE=0
E0328 11:32:38.927026   913 update_engine_client.cc:190] Update failed.

On CL 1520.8.0.

andor44 commented Mar 28, 2018

Are the update servers disabled because of this by any chance? I created a new CL VM today from an older image and trying to force an update always results in the following:

$ update_engine_client -update
I0328 11:32:33.542951   913 update_engine_client.cc:247] Initiating update check and install.
I0328 11:32:33.544214   913 update_engine_client.cc:252] Waiting for update to complete.
LAST_CHECKED_TIME=1522236755
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_IDLE
NEW_VERSION=0.0.0
NEW_SIZE=0
E0328 11:32:38.927026   913 update_engine_client.cc:190] Update failed.

On CL 1520.8.0.

@bgilbert

This comment has been minimized.

Show comment
Hide comment
@bgilbert
Member

bgilbert commented Mar 28, 2018

ajeddeloh added a commit to ajeddeloh/scripts-1 that referenced this issue Mar 29, 2018

ajeddeloh added a commit to ajeddeloh/scripts-1 that referenced this issue Mar 29, 2018

@bgilbert

This comment has been minimized.

Show comment
Hide comment
@bgilbert

bgilbert Mar 30, 2018

Member

For updates on the status of this bug in the stable channel, see the coreos-user thread.

@andor44 Updates are now re-enabled and pointing at 1632.3.0. Machines that are already running 1688.4.0 will not automatically downgrade.

Member

bgilbert commented Mar 30, 2018

For updates on the status of this bug in the stable channel, see the coreos-user thread.

@andor44 Updates are now re-enabled and pointing at 1632.3.0. Machines that are already running 1688.4.0 will not automatically downgrade.

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Apr 3, 2018

Some info for debugging:

  • You can cork download-image the 1618.0.0 release which you can reliably trigger the bug in by editting menu entries.
  • gs://builds.developer.core-os.net/sdk/amd64/1618.0.0/pkgs/sys-boot/grub-2.02-r1.tbz2 has the grub binpkg from that release
  • To debug with the coreos_production_qemu.sh script you need to disable kvm acceleration
  • use -s -S to allow gdb to connect to the qemu session
  • /usr/lib/grub/kernel.exec has debug symbols that can be loaded into gdb with symbol-file, but it doesn't have all of them. (from the binpkg)
  • diskboot.img is loaded at 0x8000 and starts executing at 0x8200. Its loads kernel.img then jumps to 0x9000
  • kernel.img starts executing at 0x9000, immediately calls grub_main()
  • I haven't been able to get useful backtraces, they mostly look like garbage. It's very possible that gdb isn't correctly walking the stack (is it expecting real mode?)
  • the gdb backtrace command gets confused. If you want a real backtrace, walk the stack manually.

Getting symbols from modules:

  • unpack the binpkg somewhere
  • http://v3.sk/~lkundrak/grub2-gdb/howto.html has a .gdbinit that can be placed in <binpkgroot>/usr/lib/grub/i386/. Add architecture i386:x86-64 to the start
  • copy our grub sources to <binpkgroot>/usr/lib/grub-2.02
  • launch gdb from <binpkgroot>/usr/lib/grub/i386/
  • It will break on grubs module load and load the debug symbols into gdb based on where the grub code loaded them

ajeddeloh commented Apr 3, 2018

Some info for debugging:

  • You can cork download-image the 1618.0.0 release which you can reliably trigger the bug in by editting menu entries.
  • gs://builds.developer.core-os.net/sdk/amd64/1618.0.0/pkgs/sys-boot/grub-2.02-r1.tbz2 has the grub binpkg from that release
  • To debug with the coreos_production_qemu.sh script you need to disable kvm acceleration
  • use -s -S to allow gdb to connect to the qemu session
  • /usr/lib/grub/kernel.exec has debug symbols that can be loaded into gdb with symbol-file, but it doesn't have all of them. (from the binpkg)
  • diskboot.img is loaded at 0x8000 and starts executing at 0x8200. Its loads kernel.img then jumps to 0x9000
  • kernel.img starts executing at 0x9000, immediately calls grub_main()
  • I haven't been able to get useful backtraces, they mostly look like garbage. It's very possible that gdb isn't correctly walking the stack (is it expecting real mode?)
  • the gdb backtrace command gets confused. If you want a real backtrace, walk the stack manually.

Getting symbols from modules:

  • unpack the binpkg somewhere
  • http://v3.sk/~lkundrak/grub2-gdb/howto.html has a .gdbinit that can be placed in <binpkgroot>/usr/lib/grub/i386/. Add architecture i386:x86-64 to the start
  • copy our grub sources to <binpkgroot>/usr/lib/grub-2.02
  • launch gdb from <binpkgroot>/usr/lib/grub/i386/
  • It will break on grubs module load and load the debug symbols into gdb based on where the grub code loaded them
@bgilbert

This comment has been minimized.

Show comment
Hide comment
@bgilbert

bgilbert Apr 3, 2018

Member

Update: Stable 1688.5.3 is rolling out now with a working build. We still need to find and address the underlying cause.

Member

bgilbert commented Apr 3, 2018

Update: Stable 1688.5.3 is rolling out now with a working build. We still need to find and address the underlying cause.

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Apr 4, 2018

I've added some code to get primitive stacktraces in gdb. The issue appears to be both gdb and qemu seem to think they're not in protected mode, so they do things like backtraces wrong with 8 byte addresses. The backtrace code is based on https://stackoverflow.com/questions/24160995/gdb-backtrace-by-walking-frame-pointers

https://gist.github.com/ajeddeloh/9b74fe9527afa614506c25f0442b056f

ajeddeloh commented Apr 4, 2018

I've added some code to get primitive stacktraces in gdb. The issue appears to be both gdb and qemu seem to think they're not in protected mode, so they do things like backtraces wrong with 8 byte addresses. The backtrace code is based on https://stackoverflow.com/questions/24160995/gdb-backtrace-by-walking-frame-pointers

https://gist.github.com/ajeddeloh/9b74fe9527afa614506c25f0442b056f

@cgwalters

This comment has been minimized.

Show comment
Hide comment
@cgwalters

cgwalters Apr 4, 2018

Member

This was introduced with changing /boot to fat32.

The FAT32 requirement is only for EFI systems though, not BIOS right? FWIW for Fedora derivatives we currently default to either ext4 or XFS for /boot on BIOS systems - on EFI then the ESP is /boot/efi which is FAT32.

Although this seems contradicted by:

It does not matter if /boot is fat16 or fat32

Member

cgwalters commented Apr 4, 2018

This was introduced with changing /boot to fat32.

The FAT32 requirement is only for EFI systems though, not BIOS right? FWIW for Fedora derivatives we currently default to either ext4 or XFS for /boot on BIOS systems - on EFI then the ESP is /boot/efi which is FAT32.

Although this seems contradicted by:

It does not matter if /boot is fat16 or fat32

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Apr 4, 2018

@cgwalters grub still accesses /boot on bios. We have the same /boot for bios and efi systems. It was a coincidence that we noticed it when we switched from 16->32 and initially thought that was the cause. Later we discovered it actually doesn't matter and we could trigger it on pre-16/32-switch images.

ajeddeloh commented Apr 4, 2018

@cgwalters grub still accesses /boot on bios. We have the same /boot for bios and efi systems. It was a coincidence that we noticed it when we switched from 16->32 and initially thought that was the cause. Later we discovered it actually doesn't matter and we could trigger it on pre-16/32-switch images.

@ajeddeloh ajeddeloh changed the title from grub fails when kernel command line is specified to grub fails with message "free magic broken" or "alloc magic broken" Apr 4, 2018

@dongsupark

This comment has been minimized.

Show comment
Hide comment
@dongsupark

dongsupark Apr 5, 2018

During releases of Flatcar Linux stable v1688.4.0, I became interested in this issue. So I tried to reproduce the bug with a Flatcar qemu image, following the steps described in this issue. Basically I ran flatcar_production_qemu.sh, went into the grub menu, edited the first entry, added an additional kernel parameter aaaaaaaaaaaaaaaaaaaa, and pressed ^X to boot.

However, I cannot reproduce the issue. I'm not seeing the message like free magic broken or alloc magic broken. A Flatcar Linux VM boots just fine.

I wondered if I needed to disable KVM acceleration as mentioned in the comment. Though it was also not the case. Even after disabling KVM acceleration, still I cannot reproduce it.

So I think there must be other factors that trigger this bug, so it does not happen always. Or am I missing something?

dongsupark commented Apr 5, 2018

During releases of Flatcar Linux stable v1688.4.0, I became interested in this issue. So I tried to reproduce the bug with a Flatcar qemu image, following the steps described in this issue. Basically I ran flatcar_production_qemu.sh, went into the grub menu, edited the first entry, added an additional kernel parameter aaaaaaaaaaaaaaaaaaaa, and pressed ^X to boot.

However, I cannot reproduce the issue. I'm not seeing the message like free magic broken or alloc magic broken. A Flatcar Linux VM boots just fine.

I wondered if I needed to disable KVM acceleration as mentioned in the comment. Though it was also not the case. Even after disabling KVM acceleration, still I cannot reproduce it.

So I think there must be other factors that trigger this bug, so it does not happen always. Or am I missing something?

@lucab

This comment has been minimized.

Show comment
Hide comment
@lucab

lucab Apr 5, 2018

Member

@dongsupark a given image will succeed or fail deterministically, but it is not yet sure what's going to make such image succeed or fail. Your current Flatcar release may be in the non-failing set now, but later releases may (or may not) experience the same bug. For ContainerLinux this was quite sporadic till last month, when it started to affect most of the builds.

Member

lucab commented Apr 5, 2018

@dongsupark a given image will succeed or fail deterministically, but it is not yet sure what's going to make such image succeed or fail. Your current Flatcar release may be in the non-failing set now, but later releases may (or may not) experience the same bug. For ContainerLinux this was quite sporadic till last month, when it started to affect most of the builds.

@martinezjavier

This comment has been minimized.

Show comment
Hide comment
@martinezjavier

martinezjavier Apr 5, 2018

@ajeddeloh I still didn't setup a dev env to reproduce this issue, but I want to share some thoughts. I'm also not that familiar with Container Linux, so please forgive me if some of my assumptions are wrong.

  • You mentioned that it only happens when booting on BIOS mode and not on EFI, that's a good data point. So I would guess that the problem should either be related to some of the operations done by grub_install.sh --target i386-pc or a BIOS specific GRUB 2 command (i.e: linux16).

  • You also said that running grub_install.sh several times, sometimes "fixes" it. So can you check if there's a difference in the commands executed with --target i386-pc in both cases? For example, it would be good to pass the -v option to grub-bios-setup to get more output and check if there's something different.

  • It seems getting a proper backtrace and debug symbols isn't easy, so what about adding some grub_dprintf("coreos", ...) printouts and set debug=coreos to chase down where the memory corruption happens? From your comments it seems that doesn't happen until you hit ^X, so maybe focusing on that code path?

martinezjavier commented Apr 5, 2018

@ajeddeloh I still didn't setup a dev env to reproduce this issue, but I want to share some thoughts. I'm also not that familiar with Container Linux, so please forgive me if some of my assumptions are wrong.

  • You mentioned that it only happens when booting on BIOS mode and not on EFI, that's a good data point. So I would guess that the problem should either be related to some of the operations done by grub_install.sh --target i386-pc or a BIOS specific GRUB 2 command (i.e: linux16).

  • You also said that running grub_install.sh several times, sometimes "fixes" it. So can you check if there's a difference in the commands executed with --target i386-pc in both cases? For example, it would be good to pass the -v option to grub-bios-setup to get more output and check if there's something different.

  • It seems getting a proper backtrace and debug symbols isn't easy, so what about adding some grub_dprintf("coreos", ...) printouts and set debug=coreos to chase down where the memory corruption happens? From your comments it seems that doesn't happen until you hit ^X, so maybe focusing on that code path?

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Apr 5, 2018

You mentioned that it only happens when booting on BIOS mode and not on EFI, that's a good data point. So I would guess that the problem should either be related to some of the operations done by grub_install.sh --target i386-pc or a BIOS specific GRUB 2 command (i.e: linux16).

Yup, investigating that now. I've got some macros that emulate the functionality of the get_header_from_pointer() function in kern/mm.c so I can inspect if things are getting corrupted. I'm also working on tracking mallocs/free's so I can find what malloc'd block any artibrary pointer belongs to (assuming it's heap alloc'd). It's definitely overwriting the address grub complains about in grub_cmd_linux while loading data (presumably the linux kernel) off disk, so now I'm trying to figure out why or if that address is actually supposed to be part of a free block.

You also said that running grub_install.sh several times, sometimes "fixes" it. So can you check if there's a difference in the commands executed with --target i386-pc in both cases? For example, it would be good to pass the -v option to grub-bios-setup to get more output and check if there's something different.

I can't get a failing build locally, even checking out the build-1618 branch. When I can, I''ll try this.

It seems getting a proper backtrace and debug symbols isn't easy, so what about adding some grub_dprintf("coreos", ...) printouts and set debug=coreos to chase down where the memory corruption happens? From your comments it seems that doesn't happen until you hit ^X, so maybe focusing on that code path?

I (finally) got gdb + qemu working together properly and that .gdbinit script loads the debug symbols for the modules just fine, so this isn't an issue anymore. One problem I'm still facing is that I can't get a local build to fail, so I'm stuck with the 1618.0.0. image and debug symbols from the binpkg.

ajeddeloh commented Apr 5, 2018

You mentioned that it only happens when booting on BIOS mode and not on EFI, that's a good data point. So I would guess that the problem should either be related to some of the operations done by grub_install.sh --target i386-pc or a BIOS specific GRUB 2 command (i.e: linux16).

Yup, investigating that now. I've got some macros that emulate the functionality of the get_header_from_pointer() function in kern/mm.c so I can inspect if things are getting corrupted. I'm also working on tracking mallocs/free's so I can find what malloc'd block any artibrary pointer belongs to (assuming it's heap alloc'd). It's definitely overwriting the address grub complains about in grub_cmd_linux while loading data (presumably the linux kernel) off disk, so now I'm trying to figure out why or if that address is actually supposed to be part of a free block.

You also said that running grub_install.sh several times, sometimes "fixes" it. So can you check if there's a difference in the commands executed with --target i386-pc in both cases? For example, it would be good to pass the -v option to grub-bios-setup to get more output and check if there's something different.

I can't get a failing build locally, even checking out the build-1618 branch. When I can, I''ll try this.

It seems getting a proper backtrace and debug symbols isn't easy, so what about adding some grub_dprintf("coreos", ...) printouts and set debug=coreos to chase down where the memory corruption happens? From your comments it seems that doesn't happen until you hit ^X, so maybe focusing on that code path?

I (finally) got gdb + qemu working together properly and that .gdbinit script loads the debug symbols for the modules just fine, so this isn't an issue anymore. One problem I'm still facing is that I can't get a local build to fail, so I'm stuck with the 1618.0.0. image and debug symbols from the binpkg.

@vathpela

This comment has been minimized.

Show comment
Hide comment
@vathpela

vathpela Apr 5, 2018

I don't know how much it will help, but I've cherry-picked some patches to do better built-in backtraces into a branch based on FSF's current master branch, here: https://github.com/vathpela/grub2-fedora/tree/backtrace . It also adds a patch so that if you're running with debug=gdb it'll print the .text and .data addresses for kernel.exec and all the modules it loads, so you can use gdb with qemu -s and use the messages it prints to load local debuginfo from the build tree.

(I haven't tested this branch, just cherry picked from a branch where this all works.)

vathpela commented Apr 5, 2018

I don't know how much it will help, but I've cherry-picked some patches to do better built-in backtraces into a branch based on FSF's current master branch, here: https://github.com/vathpela/grub2-fedora/tree/backtrace . It also adds a patch so that if you're running with debug=gdb it'll print the .text and .data addresses for kernel.exec and all the modules it loads, so you can use gdb with qemu -s and use the messages it prints to load local debuginfo from the build tree.

(I haven't tested this branch, just cherry picked from a branch where this all works.)

@blixtra

This comment has been minimized.

Show comment
Hide comment
@blixtra

blixtra Apr 6, 2018

@lucab We've started to see this in subsequent builds of Flatcar. We're looking into it as well now.

blixtra commented Apr 6, 2018

@lucab We've started to see this in subsequent builds of Flatcar. We're looking into it as well now.

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Apr 6, 2018

@vathpela thanks, at this point gdb+qemu is working pretty well. That script I linked earlier breaks on module load and loads the debug info accordingly

@blixtra I'm assuming flatcar's tooling is pretty similar to ours. If you modify the coreos_production_qemu.sh (or whatever your equivilent is) to use qemu-system-i386 and turn off kvm acceleration you can then use gdb with it (run set arch i386 or your backtraces won't make sense). You can then use the instructions in my earlier comment to get full debugging symbols. I've also wrote an (incredibly hacky) malloc/free tracker in python+gdb macros: https://gist.github.com/ajeddeloh/16de09739e5c1c8f4671e5282eb573e3

allocs will be a map of malloc'd memory addresses to size/where it was malloc'd from. ptr_magic is a macro to take any pointer that's part of a malloc'd block and print the metadata about it.

Edit:
https://gist.github.com/ajeddeloh/8e6d33b584ca1cee9433c0e6aa16c016 is the updated version that checks handles grub_realloc correctly. I checked all grub_memmoves and they're all valid, barring some error in that accounting logic. Also, that makes grub take ages to boot (like 15 min).

ajeddeloh commented Apr 6, 2018

@vathpela thanks, at this point gdb+qemu is working pretty well. That script I linked earlier breaks on module load and loads the debug info accordingly

@blixtra I'm assuming flatcar's tooling is pretty similar to ours. If you modify the coreos_production_qemu.sh (or whatever your equivilent is) to use qemu-system-i386 and turn off kvm acceleration you can then use gdb with it (run set arch i386 or your backtraces won't make sense). You can then use the instructions in my earlier comment to get full debugging symbols. I've also wrote an (incredibly hacky) malloc/free tracker in python+gdb macros: https://gist.github.com/ajeddeloh/16de09739e5c1c8f4671e5282eb573e3

allocs will be a map of malloc'd memory addresses to size/where it was malloc'd from. ptr_magic is a macro to take any pointer that's part of a malloc'd block and print the metadata about it.

Edit:
https://gist.github.com/ajeddeloh/8e6d33b584ca1cee9433c0e6aa16c016 is the updated version that checks handles grub_realloc correctly. I checked all grub_memmoves and they're all valid, barring some error in that accounting logic. Also, that makes grub take ages to boot (like 15 min).

@alban

This comment has been minimized.

Show comment
Hide comment
@alban

alban Apr 9, 2018

@ajeddeloh I have used your GDB script but changed the watchpoint to watch *0x3fcd2d30 because the error message "free magic is broken at" is at this address for me.

This stack is the last stack before I get the "broken magic" error messages:

Hardware watchpoint 11: *0x3fcd2d30

Old value = 0x681efba0
New value = 0x681efbe2
0x0000d34f in tab ()
(gdb) bt
#0  0x0000d34f in tab ()
#1  0x00000018 in grub_tsc_calibrate_from_efi () at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/i386/efi/tsc.c:32
#2  0x03ec8000 in ?? ()
#3  0x3fc38267 in ?? ()
#4  0x3ffbc628 in ?? ()
#5  0x3ffd971a in ?? ()
#6  0x3ffc617f in ?? ()
#7  0x3ffc50eb in ?? ()
#8  0x3ffc5c7c in ?? ()
#9  0x3ffc50eb in ?? ()
#10 0x3ffc61f5 in ?? ()
#11 0x3ffc6246 in ?? ()
#12 0x3ffbe54c in ?? ()
#13 0x3ffbe27a in ?? ()
#14 0x3ffbb4da in ?? ()
#15 0x3ffbb658 in ?? ()
#16 0x3ffbb6bc in ?? ()
#17 0x0000d2bb in tab ()
#18 0x00001036 in grub_efidisk_readwrite (disk=<optimized out>, sector=0x3ff98f303ff991a0, size=<optimized out>, warning: Cannot convert floating-point register value to non-floating-point type.
buf=0x0 <start>, warning: Cannot convert floating-point register value to non-floating-point type.
wr=0x0) at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/disk/efi/efidisk.c:567
#19 0x000000a3 in grub_tsc_calibrate_from_efi () at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/i386/efi/tsc.c:40
#20 0x00008bca in grub_divmod64 (n=0x300007ffec, d=0x7ffec0007ffc0, r=0x194) at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/misc.c:553
#21 0x00002874 in grub_efi_print_device_path (dp=0xe2) at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/efi/efi.c:776
#22 0x00000019 in grub_tsc_calibrate_from_efi () at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/i386/efi/tsc.c:32
#23 0x00000033 in grub_get_tsc () at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/include/grub/i386/tsc.h:46
#24 grub_tsc_calibrate_from_efi () at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/i386/efi/tsc.c:35
#25 0x3ff99230 in ?? ()
#26 0x00009041 in parse_printf_args (fmt0=<optimized out>, args=0x30, args_in=0xe2) at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/misc.c:785
#27 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.

In frame 18, I see:
https://github.com/coreos/grub/blob/2.02-coreos/grub-core/disk/efi/efidisk.c#L567-L569

  status =  efi_call_5 ((wr ? bio->write_blocks : bio->read_blocks), bio,
			bio->media->media_id, (grub_efi_uint64_t) sector,
			(grub_efi_uintn_t) num_bytes, aligned_buf);

Could it be that it is reading from disk and writing the buffer to aligned_buf but overflowing the buffer? It is writing num_bytes bytes on that buffer:

  num_bytes = size << disk->log_sector_size;

aligned_buf could be allocated from aligned_buf = grub_memalign (io_align, num_bytes); or passed from the caller. I don't know if this is going in the right direction...

alban commented Apr 9, 2018

@ajeddeloh I have used your GDB script but changed the watchpoint to watch *0x3fcd2d30 because the error message "free magic is broken at" is at this address for me.

This stack is the last stack before I get the "broken magic" error messages:

Hardware watchpoint 11: *0x3fcd2d30

Old value = 0x681efba0
New value = 0x681efbe2
0x0000d34f in tab ()
(gdb) bt
#0  0x0000d34f in tab ()
#1  0x00000018 in grub_tsc_calibrate_from_efi () at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/i386/efi/tsc.c:32
#2  0x03ec8000 in ?? ()
#3  0x3fc38267 in ?? ()
#4  0x3ffbc628 in ?? ()
#5  0x3ffd971a in ?? ()
#6  0x3ffc617f in ?? ()
#7  0x3ffc50eb in ?? ()
#8  0x3ffc5c7c in ?? ()
#9  0x3ffc50eb in ?? ()
#10 0x3ffc61f5 in ?? ()
#11 0x3ffc6246 in ?? ()
#12 0x3ffbe54c in ?? ()
#13 0x3ffbe27a in ?? ()
#14 0x3ffbb4da in ?? ()
#15 0x3ffbb658 in ?? ()
#16 0x3ffbb6bc in ?? ()
#17 0x0000d2bb in tab ()
#18 0x00001036 in grub_efidisk_readwrite (disk=<optimized out>, sector=0x3ff98f303ff991a0, size=<optimized out>, warning: Cannot convert floating-point register value to non-floating-point type.
buf=0x0 <start>, warning: Cannot convert floating-point register value to non-floating-point type.
wr=0x0) at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/disk/efi/efidisk.c:567
#19 0x000000a3 in grub_tsc_calibrate_from_efi () at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/i386/efi/tsc.c:40
#20 0x00008bca in grub_divmod64 (n=0x300007ffec, d=0x7ffec0007ffc0, r=0x194) at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/misc.c:553
#21 0x00002874 in grub_efi_print_device_path (dp=0xe2) at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/efi/efi.c:776
#22 0x00000019 in grub_tsc_calibrate_from_efi () at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/i386/efi/tsc.c:32
#23 0x00000033 in grub_get_tsc () at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/include/grub/i386/tsc.h:46
#24 grub_tsc_calibrate_from_efi () at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/i386/efi/tsc.c:35
#25 0x3ff99230 in ?? ()
#26 0x00009041 in parse_printf_args (fmt0=<optimized out>, args=0x30, args_in=0xe2) at /var/tmp/portage/sys-boot/grub-2.02-r1/work/grub-2.02/grub-core/kern/misc.c:785
#27 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.

In frame 18, I see:
https://github.com/coreos/grub/blob/2.02-coreos/grub-core/disk/efi/efidisk.c#L567-L569

  status =  efi_call_5 ((wr ? bio->write_blocks : bio->read_blocks), bio,
			bio->media->media_id, (grub_efi_uint64_t) sector,
			(grub_efi_uintn_t) num_bytes, aligned_buf);

Could it be that it is reading from disk and writing the buffer to aligned_buf but overflowing the buffer? It is writing num_bytes bytes on that buffer:

  num_bytes = size << disk->log_sector_size;

aligned_buf could be allocated from aligned_buf = grub_memalign (io_align, num_bytes); or passed from the caller. I don't know if this is going in the right direction...

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Apr 9, 2018

@alban is that on an EFI system?

ajeddeloh commented Apr 9, 2018

@alban is that on an EFI system?

@lucab

This comment has been minimized.

Show comment
Hide comment
@lucab

lucab Apr 9, 2018

Member

Seemingly related coreos/grub#53

Member

lucab commented Apr 9, 2018

Seemingly related coreos/grub#53

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Apr 9, 2018

I manually went in and fixed the dest (both in gdb and by hex-editing linux.mod) to be correct (what the grub PR is doing) and in both cases that fixed the free magic issue. Additionally, changing it back to the old value caused it to manifest again, so I'm pretty confident that is the issue.

ajeddeloh commented Apr 9, 2018

I manually went in and fixed the dest (both in gdb and by hex-editing linux.mod) to be correct (what the grub PR is doing) and in both cases that fixed the free magic issue. Additionally, changing it back to the old value caused it to manifest again, so I'm pretty confident that is the issue.

@alban

This comment has been minimized.

Show comment
Hide comment
@alban

alban Apr 10, 2018

@ajeddeloh I don't know if it is EFI or not (I just used an equivalent of coreos_production_qemu.sh). But I also think that @lucab 's patch is the good fix 👍

alban commented Apr 10, 2018

@ajeddeloh I don't know if it is EFI or not (I just used an equivalent of coreos_production_qemu.sh). But I also think that @lucab 's patch is the good fix 👍

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Apr 10, 2018

Closed via coreos/coreos-overlay#3166. This will be in the next alpha (released this week). Unfortunately since we cannot update grub, reprovisioning will be necessary to pick up this change.

ajeddeloh commented Apr 10, 2018

Closed via coreos/coreos-overlay#3166. This will be in the next alpha (released this week). Unfortunately since we cannot update grub, reprovisioning will be necessary to pick up this change.

@bgilbert

This comment has been minimized.

Show comment
Hide comment
@bgilbert

bgilbert Apr 10, 2018

Member

While reprovisioning is necessary to pick up the code change involved here, we haven't observed a case of an update triggering this error, so we currently think that users will not need to reprovision.

If you do encounter the free magic message, re-provisioning should fix it, and please do let us know!

We'll continue to investigate the full impact in #2400.

Member

bgilbert commented Apr 10, 2018

While reprovisioning is necessary to pick up the code change involved here, we haven't observed a case of an update triggering this error, so we currently think that users will not need to reprovision.

If you do encounter the free magic message, re-provisioning should fix it, and please do let us know!

We'll continue to investigate the full impact in #2400.

bgilbert added a commit to bgilbert/scripts that referenced this issue Apr 13, 2018

Revert "Revert "disk_util: use FAT32 on ESP""
FAT32 seemed to aggravate coreos/bugs#2284, but
now that that issue has been addressed, we can return to the correct
filesystem type.

This reverts commit 299f8fb.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment