Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2247.3.0 "wrong slab cache" lockups, may be related to cephfs #2616

Closed
nealey opened this issue Sep 26, 2019 · 9 comments

Comments

@nealey
Copy link

commented Sep 26, 2019

Issue Report

Bug

Container Linux Version

2247.3.0 beta channel

Environment

Bare metal, Dell PowerEdge r720

Expected Behavior

Machine should keep running longer than 5 minutes after booting 2247.3.0 with cephfs mounts

Actual Behavior

Machine starts logging many times per second:

kernel: cache_from_obj: Wrong slab cache. inode_cache but object is from ceph_inode_info

With an occasional oops message. Eventually it reboots, possibly due to watchdog.

This may be related to cephfs mounts. I was able to downgrade to stable with this kernel after stopping the docker service and killing all cephfs mounts.

Reproduction Steps

  1. Boot this machine
  2. Wait

I'm sorry I don't have more information here. This machine needed to get back into service ASAP. Hopefully this bug report can be built up by other Container Linux users.

Other Information

@ajeddeloh

This comment has been minimized.

Copy link

commented Sep 26, 2019

Is this behavior new with 2247.3.0? Do other versions work? Does alpha work?

@nealey

This comment has been minimized.

Copy link
Author

commented Sep 27, 2019

It just began today, it appears to be new with 2247.3.0 because when we reverted to the other partition there was no problem.

I'll make an attempt to reproduce this on a less-critical machine and get back to you. The machine with the problem should never have been on beta.

@nealey

This comment has been minimized.

Copy link
Author

commented Sep 27, 2019

I should mention, after reverting this machine to the stable channel, everything is fine. So it's something to do with the 2247.3.0 release.

@bgilbert

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

Could you post the oops message? The backtrace would be helpful.

@joberget

This comment has been minimized.

Copy link

commented Sep 27, 2019

H I am experiencing the same issue here with CoreOS 2247.3.0 on a vmware host. The node crashes very often with reboots. It seems like it starts spamming Ceph with lots of requests. Reverting to stable and everything works fine.

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=2247.3.0
VERSION_ID=2247.3.0
BUILD_ID=2019-09-24-0005
PRETTY_NAME="Container Linux by CoreOS 2247.3.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
[  101.915699] cache_from_obj: Wrong slab cache. inode_cache but object is from ceph_inode_info
[  101.916019] WARNING: CPU: 6 PID: 3581 at mm/slab.h:380 kmem_cache_free+0x139/0x1d0
[  101.916290] Modules linked in: nbd xt_statistic xt_physdev xt_nat xt_recent ipt_REJECT nf_reject_ipv4 cbc ceph libceph fscache sch_htb ebt_ip ebtable_filter ebtables veth xt_mark xt_comment ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc overlay vmw_vsock_vmci_transport vsock nls_ascii nls_cp437 vfat fat mousedev vmw_balloon sb_edac psmouse evdev vmw_vmci edac_core i2c_piix4 i2c_core button sch_fq_codel ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_verity dm_bufio sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper ata_piix vmxnet3 vmw_pvscsi libata scsi_mod dm_mirror dm_region_hash dm_log
[  101.918684]  dm_mod
[  101.918790] CPU: 6 PID: 3581 Comm: systemd-machine Not tainted 4.19.75-coreos #1
[  101.919076] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/03/2018
[  101.919474] RIP: 0010:kmem_cache_free+0x139/0x1d0
[  101.919661] Code: 0f 84 09 ff ff ff 48 3b a8 d0 00 00 00 74 6a 48 8b 48 58 48 8b 55 58 48 c7 c6 c0 a2 c4 b8 48 c7 c7 38 36 0b b9 e8 95 53 eb ff <0f> 0b e9 de fe ff ff 65 8b 05 e9 c6 de 47 89 c0 48 0f a3 05 ef 4a
[  101.920339] RSP: 0018:ffff925d1f703ed8 EFLAGS: 00010246
[  101.920543] RAX: 0000000000000050 RBX: ffff925ce6155cc0 RCX: 0000000000000000
[  101.920809] RDX: 0000000000000000 RSI: ffff925d1f7165f8 RDI: ffff925d1f7165f8
[  101.921078] RBP: ffff92551f162000 R08: 00000000000005f3 R09: 203a6a626f5f6d6f
[  101.921345] R10: 000000000000076f R11: 203a6a626f5f6d6f R12: ffff925d1f722a78
[  101.921611] R13: ffffffffb9265ec0 R14: ffff925ce6155df8 R15: 000000000000000a
[  101.921878] FS:  00007f16a9088940(0000) GS:ffff925d1f700000(0000) knlGS:0000000000000000
[  101.922188] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.922408] CR2: 0000561300dfbd00 CR3: 00000008155f2006 CR4: 00000000003606e0
[  101.922705] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  101.922972] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  101.923283] Call Trace:
[  101.923396]  <IRQ>
[  101.923496]  ? free_inode_nonrcu+0x20/0x20
[  101.923667]  rcu_process_callbacks+0x298/0x7c0
[  101.923847]  ? hrtimer_wakeup+0x1e/0x30
[  101.924008]  __do_softirq+0xe3/0x2eb
[  101.924160]  irq_exit+0xc8/0xd0
[  101.924296]  smp_apic_timer_interrupt+0x74/0x140
[  101.924480]  apic_timer_interrupt+0xf/0x20
[  101.924647]  </IRQ>
[  101.924749] RIP: 0010:___bpf_prog_run+0x6c4/0xf20
[  101.924936] Code: 7d f9 ff ff 0f b6 43 01 48 83 c3 08 89 c2 83 e0 0f c0 ea 04 0f b6 d2 48 8b 54 d5 00 48 89 54 c5 00 e9 5b f9 ff ff 0f b6 43 01 <8b> 7b 04 48 83 c3 08 83 e0 0f 48 89 7c c5 00 e9 43 f9 ff ff 0f b6
[  101.925614] RSP: 0018:ffff9f1387287ce0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff13
[  101.925907] RAX: 0000000000000000 RBX: ffff9f1386df3158 RCX: 0000000000000000
[  101.926176] RDX: 0000000040000000 RSI: 00000000000000b4 RDI: 00000000c000003e
[  101.926443] RBP: ffff9f1387287d28 R08: 0000000000000000 R09: 0000000000000000
[  101.926709] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  101.926976] R13: ffffffffb8c43b80 R14: 0000000000000000 R15: 0000000000000000
[  101.927245]  ? __bpf_prog_run32+0x39/0x60
[  101.927410]  ? __switch_to+0x3da/0x440
[  101.927564]  ? __switch_to_asm+0x41/0x70
[  101.927724]  ? __switch_to_asm+0x35/0x70
[  101.927887]  ? ep_item_poll.isra.20+0x40/0xc0
[  101.928063]  ? seccomp_run_filters+0x5c/0xb0
[  101.928235]  ? ep_send_events_proc+0x7b/0x1a0
[  101.928410]  ? ep_read_events_proc+0xd0/0xd0
[  101.928582]  ? __seccomp_filter+0x44/0x4a0
[  101.928748]  ? ep_poll+0x1e4/0x3a0
[  101.928893]  ? syscall_trace_enter+0x192/0x2b0
[  101.929073]  ? do_syscall_64+0xc7/0x100
[  101.929231]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  101.929434] ---[ end trace 981df86f1270b951 ]---```
@kawaja

This comment has been minimized.

Copy link

commented Sep 27, 2019

Also having same issue with CoreOS 2247.3.0 running under KVM 2.12.0. Reverted to 2247.2.0 and issue does not occur.

Unlike @nealey though, I only got the Wrong slab cache log once, not multiple times per second. My VM would also reboot shortly after logging this:

Sep 27 18:06:57 k06 kernel: cache_from_obj: Wrong slab cache. inode_cache but object is from ceph_inode_info
Sep 27 18:06:57 k06 kernel: WARNING: CPU: 1 PID: 0 at mm/slab.h:380 kmem_cache_free+0x139/0x1d0
Sep 27 18:06:57 k06 kernel: Modules linked in: iptable_mangle dummy xt_set ip_set_hash_ipport xt_ipvs ip_set_hash_ip ip_set_hash_net ip_set cbc ceph fscache rbd libceph xt_statistic xt_physdev xt_nat xt_recent ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs veth xt_comment xt_mark ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc overlay nls_ascii nls_cp437 vfat fat mousedev evdev psmouse virtio_balloon i2c_piix4 i2c_core button sch_fq_codel ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_verity dm_bufio uhci_hcd ehci_pci ehci_hcd ata_piix libata usbcore virtio_net scsi_mod net_failover virtio_console failover virtio_blk usb_common
Sep 27 18:06:57 k06 kernel:  qemu_fw_cfg dm_mirror dm_region_hash dm_log dm_mod
Sep 27 18:06:57 k06 kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.75-coreos #1
Sep 27 18:06:57 k06 kernel: Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014
Sep 27 18:06:57 k06 kernel: RIP: 0010:kmem_cache_free+0x139/0x1d0
Sep 27 18:06:57 k06 kernel: Code: 0f 84 09 ff ff ff 48 3b a8 d0 00 00 00 74 6a 48 8b 48 58 48 8b 55 58 48 c7 c6 c0 a2 c4 b4 48 c7 c7 38 36 0b b5 e8 95 53 eb ff <0f> 0b e9 de fe ff ff 65 8b 05 e9 c6 de 4b 89 c0 48 0f a3 05 ef 4a
Sep 27 18:06:57 k06 kernel: RSP: 0018:ffff947473b03ed8 EFLAGS: 00010246
Sep 27 18:06:57 k06 kernel: RAX: 0000000000000050 RBX: ffff9473dd763b30 RCX: 0000000000000000
Sep 27 18:06:57 k06 kernel: RDX: 0000000000000000 RSI: ffff947473b165f8 RDI: ffff947473b165f8
Sep 27 18:06:57 k06 kernel: RBP: ffff947473561500 R08: 000000000000026f R09: 0000000000000007
Sep 27 18:06:57 k06 kernel: R10: 000000000000076f R11: ffffffffb7ea426d R12: ffff947473b22a78
Sep 27 18:06:57 k06 kernel: R13: ffffffffb5265ec0 R14: ffff9473dd763c68 R15: 000000000000000a
Sep 27 18:06:57 k06 kernel: FS:  0000000000000000(0000) GS:ffff947473b00000(0000) knlGS:0000000000000000
Sep 27 18:06:57 k06 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 27 18:06:57 k06 kernel: CR2: 00007f725fad96e0 CR3: 0000000329760000 CR4: 00000000000006e0
Sep 27 18:06:57 k06 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 27 18:06:57 k06 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 27 18:06:57 k06 kernel: Call Trace:
Sep 27 18:06:57 k06 kernel:  <IRQ>
Sep 27 18:06:57 k06 kernel:  ? free_inode_nonrcu+0x20/0x20
Sep 27 18:06:57 k06 kernel:  rcu_process_callbacks+0x298/0x7c0
Sep 27 18:06:57 k06 kernel:  ? rebalance_domains+0x274/0x2c0
Sep 27 18:06:57 k06 kernel:  __do_softirq+0xe3/0x2eb
Sep 27 18:06:57 k06 kernel:  irq_exit+0xc8/0xd0
Sep 27 18:06:57 k06 kernel:  smp_apic_timer_interrupt+0x74/0x140
Sep 27 18:06:57 k06 kernel:  apic_timer_interrupt+0xf/0x20
Sep 27 18:06:57 k06 kernel:  </IRQ>
Sep 27 18:06:57 k06 kernel: RIP: 0010:native_safe_halt+0xe/0x10
Sep 27 18:06:57 k06 kernel: Code: ff ff 7f c3 65 48 8b 04 25 40 5c 01 00 f0 80 48 02 20 48 8b 00 a8 08 75 c4 eb 80 90 e9 07 00 00 00 0f 00 2d 56 e9 57 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 46 e9 57 00 f4 c3 90 90 0f 1f 44 00
Sep 27 18:06:57 k06 kernel: RSP: 0018:ffffa72c01903ea8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Sep 27 18:06:57 k06 kernel: RAX: ffffffffb468a9a0 RBX: 0000000000000001 RCX: ffffffffb5261ac0
Sep 27 18:06:57 k06 kernel: RDX: ffffffffb52592b8 RSI: 0000000000000000 RDI: 00000012f6b29952
Sep 27 18:06:57 k06 kernel: RBP: 0000000000000001 R08: 0000001f2db91cc0 R09: ffff9474690d4000
Sep 27 18:06:57 k06 kernel: R10: 00000000826b7e00 R11: 0000000000000001 R12: 0000000000000000
Sep 27 18:06:57 k06 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Sep 27 18:06:57 k06 kernel:  ? __sched_text_end+0x1/0x1
Sep 27 18:06:57 k06 kernel:  default_idle+0x1c/0x140
Sep 27 18:06:57 k06 kernel:  do_idle+0x1e1/0x260
Sep 27 18:06:57 k06 kernel:  cpu_startup_entry+0x6f/0x80
Sep 27 18:06:57 k06 kernel:  start_secondary+0x1a4/0x1f0
Sep 27 18:06:57 k06 kernel:  secondary_startup_64+0xa4/0xb0
Sep 27 18:06:57 k06 kernel: ---[ end trace 4246532d72e37ef0 ]---
Sep 27 18:06:57 k06 kernel: WARNING: CPU: 1 PID: 0 at kernel/rcu/tree.c:2622 rcu_process_callbacks+0x7ac/0x7c0
Sep 27 18:06:57 k06 kernel: Modules linked in: iptable_mangle dummy xt_set ip_set_hash_ipport xt_ipvs ip_set_hash_ip ip_set_hash_net ip_set cbc ceph fscache rbd libceph xt_statistic xt_physdev xt_nat xt_recent ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs veth xt_comment xt_mark ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc overlay nls_ascii nls_cp437 vfat fat mousedev evdev psmouse virtio_balloon i2c_piix4 i2c_core button sch_fq_codel ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_verity dm_bufio uhci_hcd ehci_pci ehci_hcd ata_piix libata usbcore virtio_net scsi_mod net_failover virtio_console failover virtio_blk usb_common
Sep 27 18:06:57 k06 kernel:  qemu_fw_cfg dm_mirror dm_region_hash dm_log dm_mod
Sep 27 18:06:57 k06 kernel: CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W         4.19.75-coreos #1
Sep 27 18:06:57 k06 kernel: Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014
Sep 27 18:06:57 k06 kernel: RIP: 0010:rcu_process_callbacks+0x7ac/0x7c0
Sep 27 18:06:57 k06 kernel: Code: 89 ff 88 04 24 e8 74 c3 ff ff 0f b6 04 24 84 c0 0f 84 5b f9 ff ff 48 89 ef e8 00 c3 ff ff e9 4e f9 ff ff 0f 0b e9 ae f8 ff ff <0f> 0b e9 c0 fc ff ff 0f 0b e9 0f fa ff ff e8 61 f2 f8 ff 90 0f 1f
Sep 27 18:06:57 k06 kernel: RSP: 0018:ffff947473b03ef8 EFLAGS: 00010002
Sep 27 18:06:57 k06 kernel: RAX: 0000000000000000 RBX: ffff947473b22a40 RCX: 7fffffffffffffff
Sep 27 18:06:57 k06 kernel: RDX: ffffffffffffd801 RSI: ffff947473b03f18 RDI: ffff947473b22a78
Sep 27 18:06:57 k06 kernel: RBP: ffffffffb5261ac0 R08: 0000000000000000 R09: ffffffffb40ec301
Sep 27 18:06:57 k06 kernel: R10: ffff9473d0ed1240 R11: 0000000000000001 R12: ffff947473b22a78
Sep 27 18:06:57 k06 kernel: R13: ffffffffb5265ec0 R14: 0000000000000000 R15: 000000000000000a
Sep 27 18:06:57 k06 kernel: FS:  0000000000000000(0000) GS:ffff947473b00000(0000) knlGS:0000000000000000
Sep 27 18:06:57 k06 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 27 18:06:57 k06 kernel: CR2: 00007f725fad96e0 CR3: 0000000329760000 CR4: 00000000000006e0
Sep 27 18:06:57 k06 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 27 18:06:57 k06 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 27 18:06:57 k06 kernel: Call Trace:
Sep 27 18:06:57 k06 kernel:  <IRQ>
Sep 27 18:06:57 k06 kernel:  ? rebalance_domains+0x274/0x2c0
Sep 27 18:06:57 k06 kernel:  __do_softirq+0xe3/0x2eb
Sep 27 18:06:57 k06 kernel:  irq_exit+0xc8/0xd0
Sep 27 18:06:57 k06 kernel:  smp_apic_timer_interrupt+0x74/0x140
Sep 27 18:06:57 k06 kernel:  apic_timer_interrupt+0xf/0x20
Sep 27 18:06:57 k06 kernel:  </IRQ>
Sep 27 18:06:57 k06 kernel: RIP: 0010:native_safe_halt+0xe/0x10
Sep 27 18:06:57 k06 kernel: Code: ff ff 7f c3 65 48 8b 04 25 40 5c 01 00 f0 80 48 02 20 48 8b 00 a8 08 75 c4 eb 80 90 e9 07 00 00 00 0f 00 2d 56 e9 57 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 46 e9 57 00 f4 c3 90 90 0f 1f 44 00
Sep 27 18:06:57 k06 kernel: RSP: 0018:ffffa72c01903ea8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Sep 27 18:06:57 k06 kernel: RAX: ffffffffb468a9a0 RBX: 0000000000000001 RCX: ffffffffb5261ac0
Sep 27 18:06:57 k06 kernel: RDX: ffffffffb52592b8 RSI: 0000000000000000 RDI: 00000012f6b29952
Sep 27 18:06:57 k06 kernel: RBP: 0000000000000001 R08: 0000001f2db91cc0 R09: ffff9474690d4000
Sep 27 18:06:57 k06 kernel: R10: 00000000826b7e00 R11: 0000000000000001 R12: 0000000000000000
Sep 27 18:06:57 k06 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Sep 27 18:06:57 k06 kernel:  ? __sched_text_end+0x1/0x1
Sep 27 18:06:57 k06 kernel:  default_idle+0x1c/0x140
Sep 27 18:06:57 k06 kernel:  do_idle+0x1e1/0x260
Sep 27 18:06:57 k06 kernel:  cpu_startup_entry+0x6f/0x80
Sep 27 18:06:57 k06 kernel:  start_secondary+0x1a4/0x1f0
Sep 27 18:06:57 k06 kernel:  secondary_startup_64+0xa4/0xb0
Sep 27 18:06:57 k06 kernel: ---[ end trace 4246532d72e37ef1 ]---
@bgilbert

This comment has been minimized.

Copy link
Member

commented Oct 8, 2019

Might be fixed by coreos/linux@72f0fff and coreos/linux@e9bcaf8.

Anyone who has experienced this problem: could you try out this build and report whether you still see the issue?

@poelzi

This comment has been minimized.

Copy link

commented Oct 9, 2019

I have this on NixOS 19.03 with Kernel 4.19.76 as well. Worked with 18.09 just fine.

@bgilbert

This comment has been minimized.

Copy link
Member

commented Oct 9, 2019

This should be fixed in Container Linux alpha 2275.1.0 and beta 2247.4.0, rolling out now. Thanks for the report. Please reopen if you're still seeing this problem on the new releases.

@bgilbert bgilbert closed this Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.