Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2765.2.4] 5.10.37 kernel oops in __domain_mapping+0xa7/0x3a0 #400

Closed
g-edwards opened this issue May 20, 2021 · 4 comments
Closed

[2765.2.4] 5.10.37 kernel oops in __domain_mapping+0xa7/0x3a0 #400

g-edwards opened this issue May 20, 2021 · 4 comments

Comments

@g-edwards
Copy link

g-edwards commented May 20, 2021

Description

The 5.10.37-flatcar kernel (2765.2.4 stable release) panics on boot each time on a Dell server with a Intel Xeon E5-2640 v3 processor.

[ 5.316760] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 5.317108] #PF: supervisor read access in kernel mode
[ 5.317108] #PF: error_code(0x0000) - not-present page
[ 5.317108] PGD 0 P4D 0
[ 5.317108] Oops: 0000 [#1] SMP PTI
[ 5.317108] CPU: 8 PID: 1 Comm: swapper/0 Not tainted 5.10.37-flatcar #1
[ 5.317108] Hardware name: /086D43, BIOS 2.7.1 001/22/2018
[ 5.317108] RIP: 0010:__domain_mapping+0xa7/0x3a0
[ 5.317108] Code: 0a 00 00 02 0f 85 17 02 00 00 4c 89 d3 45 31 e4 31 ed 45 31 c9 48 c1 e3 0c 4c 09 fb 4d 85 c0 0f 84 26 01 00 00 4d 85 e4 75 5d <41> 8b 55 08 41 8b 45 0c 49 8b 5d 00 89 d1 48 89 c6 81 e1 ff 0f 00
[ 5.317108] RSP: 0000:ffffa3bf0004fbd0 EFLAGS: 00010246
[ 5.317108] RAX: 0000000000000000 RBX: 000000007ae07003 RCX: 000000000000002d
[ 5.317108] RDX: 0000000000000000 RSI: 000000000007ae07 RDI: ffff90c50a11c200
[ 5.317108] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[ 5.317108] R10: 000000000007ae07 R11: ffff90c50a11c200 R12: 0000000000000000
[ 5.317108] R13: 0000000000000000 R14: 000000000007ae07 R15: 0000000000000003
[ 5.317108] FS: 0000000000000000(0000) GS:ffff90d43fc00000(0000) knlGS:0000000000000000
[ 5.317108] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.317108] CR2: 0000000000000008 CR3: 00000005e960a001 CR4: 00000000001706e0
[ 5.317108] Call Trace:
[ 5.317108] ? set_next_entity+0xa3/0x1e0
[ 5.317108] domain_mapping+0x1b/0xa0
[ 5.317108] __iommu_map+0xfd/0x1f0
[ 5.317108] _iommu_map+0x15/0x40
[ 5.317108] iommu_create_device_direct_mappings.isra.0+0x179/0x1e0
[ 5.317108] bus_iommu_probe+0x165/0x280
[ 5.317108] bus_set_iommu+0x88/0xe0
[ 5.317108] intel_iommu_init+0xef0/0x10e8
[ 5.317108] ? mntput_no_expire+0x47/0x240
[ 5.317108] ? e820__memblock_setup+0x7d/0x7d
[ 5.317108] pci_iommu_init+0x16/0x3f
[ 5.317108] do_one_initcall+0x44/0x1d0
[ 5.317108] kernel_init_freeable+0x1c9/0x20e
[ 5.317108] ? rest_init+0xb4/0xb4
[ 5.317108] kernel_init+0xa/0x10c
[ 5.317108] ret_from_fork+0x22/0x30
[ 5.317108] Modules linked in:
[ 5.317108] CR2: 0000000000000008
[ 5.317108] ---[ end trace 7bb64b28ae710c73 ]---
[ 5.317108] RIP: 0010:__domain_mapping+0xa7/0x3a0
[ 5.317108] Code: 0a 00 00 02 0f 85 17 02 00 00 4c 89 d3 45 31 e4 31 ed 45 31 c9 48 c1 e3 0c 4c 09 fb 4d 85 c0 0f 84 26 01 00 00 4d 85 e4 75 5d <41> 8b 55 08 41 8b 45 0c 49 8b 5d 00 89 d1 48 89 c6 81 e1 ff 0f 00
[ 5.317108] RSP: 0000:ffffa3bf0004fbd0 EFLAGS: 00010246
[ 5.317108] RAX: 0000000000000000 RBX: 000000007ae07003 RCX: 000000000000002d
[ 5.317108] RDX: 0000000000000000 RSI: 000000000007ae07 RDI: ffff90c50a11c200
[ 5.317108] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[ 5.317108] R10: 000000000007ae07 R11: ffff90c50a11c200 R12: 0000000000000000
[ 5.317108] R13: 0000000000000000 R14: 000000000007ae07 R15: 0000000000000003
[ 5.317108] FS: 0000000000000000(0000) GS:ffff90d43fc00000(0000) knlGS:0000000000000000
[ 5.317108] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.317108] CR2: 0000000000000008 CR3: 00000005e960a001 CR4: 00000000001706e0
[ 5.317108] Kernel panic - not syncing: Fatal exception

After the panic, the system boots back up on the previous 2765.2.3 stable release and tries again.

It appears others are seeing the same issue with 5.10.37, and looks like the patch was not backported correctly:

https://lore.kernel.org/lkml/20210515132855.4bn7ve2ozvdhpnj4@nabokov.fritz.box/

Impact

2765.2.4 stable release is not usable on this hardware.

Environment and steps to reproduce

  1. Set-up: Flatcar Linux stable on a Dell Haswell server
  2. Task: automatic update to 2765.2.4 stable release
  3. Action(s):
    a. update-engine pulls 2765.2.4 stable release
    b. reboot
  4. Error: 5.10.37-flatcar kernel panics on boot

Expected behavior

Boot succeeds.

Additional information

@pothos
Copy link
Member

pothos commented May 20, 2021

Thanks for reporting, the rollout of the 2765.2.4 Stable update got stopped in the update server and it announces 2765.2.3 as latest version for the mean time. A new kernel bug fix release already got published and we will include the fix soon in a follow-up release.

@pothos
Copy link
Member

pothos commented May 20, 2021

In case your instances have downloaded the new version already but not yet rebooted, you can discard the pending update as follows:

$ update_engine_client -reset_status 
$ sudo cgpt add -T 0 $(sudo cgpt find -t flatcar-usr 2>/dev/null | grep --invert-match "$(rootdev -s /usr)")

The second command will make sure that even when the system reboots, it will not attempt to use the 2765.2.4 partition because there are zero boot "tries" for it.

@pothos
Copy link
Member

pothos commented May 21, 2021

The Stable 2765.2.5 update will be published now

@pothos pothos closed this as completed May 21, 2021
@g-edwards
Copy link
Author

2765.2.5 works great. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants