New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure: refcount_t overflow at xfrm_policy_lookup_bytype #2495

Open
chpaadmin opened this Issue Aug 16, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@chpaadmin

chpaadmin commented Aug 16, 2018

Issue Report

I observered kernel crash on 4.14.32-coreos which is running on a Azure VM.
The crash was repeatly until reboot, and the VM begun unavailable.

Bug

Aug 07 12:50:06 dockerhost31_cluster10 kernel: ------------[ cut here ]------------
 Aug 07 12:50:06 dockerhost31_cluster10 kernel: Kernel BUG at ffffffff855f5745 [verbose debug info unavailable]
 Aug 07 12:50:06 dockerhost31_cluster10 kernel: refcount_t overflow at xfrm_policy_lookup_bytype+0x269/0x290 in 4_scheduler[87392], uid/euid: 100/100
 Aug 07 12:50:08 dockerhost31_cluster10 fleetd[883]: ERROR engine.go:217: Engine leadership lost, renewal failed: context deadline exceeded
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ------------[ cut here ]------------
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: Kernel BUG at ffffffff855f5745 [verbose debug info unavailable]
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: refcount_t overflow at xfrm_policy_lookup_bytype+0x269/0x290 in java[99615], uid/euid: 0/0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ------------[ cut here ]------------
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: WARNING: CPU: 2 PID: 99615 at ../source/kernel/panic.c:612 refcount_error_report+0x96/0xa0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: Modules linked in: loop cmac arc4 ecb md4 nls_utf8 cifs ccm fscache seqiv iptable_raw vxlan ip6_udp_tunnel udp_tunnel xt_nat xt_mark xfrm6_mode_tunnel xfrm4_mode_tunnel esp4 drbg veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat br_netfilter bridge stp llc overlay iptable_filter nls_ascii nls_cp437 vfat fat sb_edac edac_core psmouse i2c_piix4 i2c_core hv_utils hv_balloon cn evdev ptp hyperv_fb button mousedev pps_core sch_fq_codel nf_conntrack libcrc32c crc32c_generic ext4 crc16 mbcache jbd2 fscrypto dm_verity dm_bufio sd_mod hid_generic sr_mod cdrom crc32c_intel hid_hyperv hv_netvsc hyperv_keyboard hv_storvsc hid scsi_transport_fc ata_piix libata aesni_intel
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: aes_x86_64 crypto_simd cryptd glue_helper scsi_mod hv_vmbus dm_mirror dm_region_hash dm_log dm_mod dax
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: CPU: 2 PID: 99615 Comm: java Not tainted 4.14.19-coreos #1
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: task: ffff8fedb1a75ac0 task.stack: ffffa6ddc750c000
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RIP: 0010:refcount_error_report+0x96/0xa0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RSP: 0018:ffff8fef0f683a30 EFLAGS: 00010286
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RAX: 000000000000005a RBX: ffffffff85df7095 RCX: 0000000000000001
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000286
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RBP: ffff8fef0f683b78 R08: 0000000000000000 R09: 000000000000005a
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: R10: 00000000ffffffff R11: 0000000000000000 R12: ffff8fedb1a75ac0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: R13: 0000000000000000 R14: 0000000000000006 R15: ffffffff85dda879
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: FS: 00007fdfa2efa700(0000) GS:ffff8fef0f680000(0000) knlGS:0000000000000000
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: CR2: 00007fb2bc41e0c8 CR3: 00000001c688c000 CR4: 00000000001406e0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: Call Trace:
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: 
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ex_handler_refcount+0x4e/0x80
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: fixup_exception+0x32/0x40
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: do_trap+0x105/0x150
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: do_error_trap.part.12+0x86/0x100
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ? csum_partial_copy_generic+0xf25/0x1720
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ? report_bug+0xa9/0xf0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ? fixup_bug.part.11+0x18/0x30
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: invalid_op+0x22/0x40
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RIP: 0010:xfrm_policy_lookup_bytype+0x269/0x290
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RSP: 0018:ffff8fef0f683c28 EFLAGS: 00010a12
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RAX: 0000000000000000 RBX: ffff8fed5e0de000 RCX: ffff8fed5e0de030
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RDX: 00000000c0000000 RSI: ffff8fed5e0de030 RDI: 0000000000000000
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000087d19d97
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: R10: 00000000ffffffff R11: 000000004da48913 R12: ffff8fef0f683d08
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: R13: 000000000000001e R14: 00000000ffffffff R15: 0000000000000002
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ? xfrm_policy_lookup_bytype+0x1f4/0x290
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: xfrm_lookup+0x2dd/0x8f0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: __xfrm_route_forward+0x61/0x100
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ip_forward+0x39e/0x470
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ? ip_rcv_finish+0xa5/0x3f0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ip_rcv+0x27e/0x3a0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ? inet_del_offload+0x40/0x40
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: __netif_receive_skb_core+0x332/0xb60
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ? process_backlog+0x92/0x140
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: process_backlog+0x92/0x140
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: net_rx_action+0x261/0x3a0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: __do_softirq+0xf7/0x285
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: do_softirq_own_stack+0x2a/0x40
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: 
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: do_softirq.part.15+0x3d/0x50
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: __local_bh_enable_ip+0x55/0x60
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ip_finish_output2+0x19a/0x390
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ip_output+0x71/0xe0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ? ip_fragment.constprop.47+0x80/0x80
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: tcp_transmit_skb+0x524/0x9a0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: tcp_write_xmit+0x1e7/0xfb0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: __tcp_push_pending_frames+0x2d/0xd0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: tcp_sendmsg_locked+0x5ae/0xe10
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: tcp_sendmsg+0x27/0x40
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: sock_sendmsg+0x30/0x40
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: SYSC_sendto+0xd7/0x150
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ? __dentry_kill+0xde/0x160
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ? __audit_syscall_entry+0xb2/0x100
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ? syscall_trace_enter+0x1c7/0x2c0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: do_syscall_64+0x66/0x1d0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: entry_SYSCALL_64_after_hwframe+0x21/0x86
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RIP: 0033:0x7fe040cfa086
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RSP: 002b:00007fdfa2ee8de0 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RAX: ffffffffffffffda RBX: 0000000000000020 RCX: 00007fe040cfa086
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RDX: 000000000000037b RSI: 00007fdfa2ee8ee0 RDI: 0000000000000020
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007fdfa2ee8ee0
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: R13: 000000000000037b R14: 0000000000000000 R15: 0000000000000020
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: Code: 95 80 00 00 00 41 55 49 8d 8c 24 d8 06 00 00 45 8b 84 24 30 05 00 00 41 89 c1 48 89 de 48 c7 c7 68 a0 de 85 31 c0 e8 c7 6c 05 00 <0f> ff 58 5b 5d 41 5c 41 5d c3 0f 1f 44 00 00 55 48 89 e5 41 56
 Aug 07 12:50:10 dockerhost31_cluster10 kernel: ---[ end trace fa6e7ecf7f7e8dfe ]---

Container Linux Version

kernel 4.14.32

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
...
BUG_REPORT_URL="https://issues.coreos.com"

Environment

CoreOS running on Azure VM.

Reproduction Steps

Don't know how to repro

@lucab

This comment has been minimized.

Show comment
Hide comment
@lucab

lucab Aug 16, 2018

Member

Thanks for reporting this. By your kernel version, it looks like you may be running an older stable release (likely 1688.5.3). Can you confirm which OS version you are on, and if you hit the same behavior on latest stable and alpha?

Also, are you using any custom network configuration/tweaks? The trace seems to hint at a resource leak in a refcounted network object.

Member

lucab commented Aug 16, 2018

Thanks for reporting this. By your kernel version, it looks like you may be running an older stable release (likely 1688.5.3). Can you confirm which OS version you are on, and if you hit the same behavior on latest stable and alpha?

Also, are you using any custom network configuration/tweaks? The trace seems to hint at a resource leak in a refcounted network object.

@lucab lucab changed the title from CoreOS 4.14.32 crashed at "xfrm_policy_lookup_bytype" to azure: refcount_t overflow at xfrm_policy_lookup_bytype Aug 16, 2018

@chpaadmin

This comment has been minimized.

Show comment
Hide comment
@chpaadmin

chpaadmin Aug 17, 2018

Hi Luca,

Thanks for looking into that, below is the os version:

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1688.5.3
VERSION_ID=1688.5.3
BUILD_ID=2018-04-03-0547
PRETTY_NAME="Container Linux by CoreOS 1688.5.3 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

And the VM is running in Azure, no specified network configurations as far as I know. As you mentioned, there would be resource leak, is there any further information we could check?

chpaadmin commented Aug 17, 2018

Hi Luca,

Thanks for looking into that, below is the os version:

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1688.5.3
VERSION_ID=1688.5.3
BUILD_ID=2018-04-03-0547
PRETTY_NAME="Container Linux by CoreOS 1688.5.3 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

And the VM is running in Azure, no specified network configurations as far as I know. As you mentioned, there would be resource leak, is there any further information we could check?

@lucab

This comment has been minimized.

Show comment
Hide comment
@lucab

lucab Aug 17, 2018

Member

The os-release reported here (1688.5.3, shipping with kernel 4.14.32) does not match the kernel version in the panic log (Not tainted 4.14.19-coreos). As both are a quite ancient kernels/releases, before spending more time on this I suggest to check latest stable/alpha release and see if the same overflow happens there.

The resource leak seems to be related to ip-xfrm policies (i.e. likely some tunneled/encrypted interface). It may be possible to observe the effects of it via sudo ip xfrm policy list.

Member

lucab commented Aug 17, 2018

The os-release reported here (1688.5.3, shipping with kernel 4.14.32) does not match the kernel version in the panic log (Not tainted 4.14.19-coreos). As both are a quite ancient kernels/releases, before spending more time on this I suggest to check latest stable/alpha release and see if the same overflow happens there.

The resource leak seems to be related to ip-xfrm policies (i.e. likely some tunneled/encrypted interface). It may be possible to observe the effects of it via sudo ip xfrm policy list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment