New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel panic with vxlan in openvswitch (via openshift) #2382

Closed
squeed opened this Issue Mar 14, 2018 · 22 comments

Comments

Projects
None yet
8 participants
@squeed

squeed commented Mar 14, 2018

Issue Report

The Openshift openvswitch-based network kernel panics as soon as a pod receives a packet from an external node.

Bug

Container Linux Version

1688.3.0

Environment

libvirt+qemu

Reproduction Steps

This is a bit complicated. I have a hybrid OpenShift cluster on qemu, where some workers are CentOS and some are Container Linux (don't judge)

I've got a script and some bootstrapping instructions here: https://github.com/squeed/os-on-cl

Once you have a cluster running:

  1. Kill all the other workers, so you get scheduled where you want.
  2. Run a pod: kubectl run --rm -ri --image alpine test /bin/sh
  3. Get the pods IP: ip addr
  4. On another node in the cluster, ping that IP. The node should kernel panic instantly.

Other information

The traceback:

[ 3187.113634] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 3187.127232] IP:           (null)
[ 3187.130072] PGD 0 P4D 0 
[ 3187.132790] Oops: 0010 [#1] SMP PTI
[ 3187.135579] Modules linked in: veth xt_nat xt_recent ipt_REJECT nf_reject_ipv4 xt_mark xt_comment ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat xt_addrtype iptable_filter xt_conntrack br_netfilter bridge stp llc vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c crc32c_generic overlay nls_ascii nls_cp437 vfat fat mousedev evdev virtio_balloon psmouse i2c_piix4 i2c_core button sch_fq_codel ext4 crc16 mbcache jbd2 fscrypto dm_verity dm_bufio virtio_console virtio_blk uhci_hcd crc32c_intel 8139too ata_piix aesni_intel aes_x86_64 crypto_simd libata ehci_pci cryptd ehci_hcd glue_helper scsi_mod virtio_pci virtio_ring virtio
[ 3187.152444]  usbcore usb_common 8139cp mii qemu_fw_cfg dm_mirror dm_region_hash dm_log dm_mod dax
[ 3187.155193] CPU: 0 PID: 828 Comm: handler2 Not tainted 4.14.24-coreos #1
[ 3187.157113] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[ 3187.160340] task: ffff9bca82850000 task.stack: ffffa98080454000
[ 3187.162307] RIP: 0010:          (null)
[ 3187.163972] RSP: 0018:ffffa98080457738 EFLAGS: 00010286
[ 3187.166516] RAX: 0000000000000000 RBX: ffff9bca82ab73a8 RCX: 00000000000005aa
[ 3187.168583] RDX: ffff9bca82ab7700 RSI: 0000000000000000 RDI: ffff9bca82ab7300
[ 3187.170548] RBP: ffffa98080457820 R08: 0000000000000006 R09: ffff9bcaba1cf300
[ 3187.172544] R10: 0000000000000002 R11: 0000000000000000 R12: ffff9bca82ab7700
[ 3187.174510] R13: ffff9bcab6a32c00 R14: ffff9bcab6a32c00 R15: ffff9bcab80d2000
[ 3187.176484] FS:  00007f0b95c12700(0000) GS:ffff9bcabfc00000(0000) knlGS:0000000000000000
[ 3187.179044] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3187.183692] CR2: 0000000000000000 CR3: 0000000075a3e003 CR4: 00000000003606f0
[ 3187.185690] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3187.187634] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3187.189599] Call Trace:
[ 3187.190635]  ? vxlan_dev_create+0x9d0/0x2d2d [vxlan]
[ 3187.192165]  ? vxlan_dev_create+0x2164/0x2d2d [vxlan]
[ 3187.193774]  ? vxlan_dev_create+0x2164/0x2d2d [vxlan]
[ 3187.195253]  ? dev_hard_start_xmit+0xa1/0x200
[ 3187.197008]  ? vxlan_dev_create+0x1f60/0x2d2d [vxlan]
[ 3187.199381]  ? dev_hard_start_xmit+0xa1/0x200
[ 3187.201510]  ? __dev_queue_xmit+0x688/0x7c0
[ 3187.202913]  ? 0xffffffffc03c6d34
[ 3187.204154]  ? __dev_queue_xmit+0x7c0/0x7c0
[ 3187.205521]  ? 0xffffffffc03c6d34
[ 3187.206680]  ? ovs_match_init+0x82a/0xd10 [openvswitch]
[ 3187.208464]  ? __kmalloc+0x191/0x210
[ 3187.209941]  ? ovs_execute_actions+0x48/0x110 [openvswitch]
[ 3187.211757]  ? ovs_execute_actions+0x48/0x110 [openvswitch]
[ 3187.214478]  ? action_fifos_exit+0x2e9/0x34e0 [openvswitch]
[ 3187.215895]  ? genl_family_rcv_msg+0x1e4/0x390
[ 3187.217117]  ? genl_rcv_msg+0x47/0x90
[ 3187.218199]  ? __kmalloc_node_track_caller+0x222/0x2c0
[ 3187.219499]  ? genl_family_rcv_msg+0x390/0x390
[ 3187.220688]  ? netlink_rcv_skb+0x4d/0x130
[ 3187.222159]  ? genl_rcv+0x24/0x40
[ 3187.223190]  ? netlink_unicast+0x196/0x240
[ 3187.224333]  ? netlink_sendmsg+0x2b8/0x3b0
[ 3187.225448]  ? sock_sendmsg+0x36/0x40
[ 3187.226478]  ? ___sys_sendmsg+0x2a0/0x2f0
[ 3187.227523]  ? sock_poll+0x70/0x90
[ 3187.228522]  ? ep_send_events_proc+0x86/0x1a0
[ 3187.230168]  ? ep_ptable_queue_proc+0xa0/0xa0
[ 3187.231823]  ? ep_scan_ready_list.constprop.17+0x217/0x220
[ 3187.233234]  ? ep_poll+0x1e3/0x3a0
[ 3187.234271]  ? __sys_sendmsg+0x51/0x90
[ 3187.235343]  ? __sys_sendmsg+0x51/0x90
[ 3187.236470]  ? do_syscall_64+0x67/0x120
[ 3187.237579]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 3187.239199] Code:  Bad RIP value.
[ 3187.240352] RIP:           (null) RSP: ffffa98080457738
[ 3187.241930] CR2: 0000000000000000
[ 3187.243113] ---[ end trace 5714c8771c746674 ]---
[ 3187.245624] Kernel panic - not syncing: Fatal exception in interrupt
[ 3187.248332] Kernel Offset: 0x2a000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 3187.251304] Rebooting in 10 seconds..
@SpComb

This comment has been minimized.

SpComb commented Mar 15, 2018

The same issue also affects Weave on CoreOS beta: https://gist.github.com/SpComb/bc439fcda4ff9d54105c28bfd4a44916

This is 100% reproducible on Vagrant with Weave 1.9.3, and packet.net with Weave 1.9.3 and 2.2.1. Both machines kernel panic as soon as weave establishes an active vxlan connection between two nodes... a single node configured with an unreachable peer will not panic.

I fear that promoting CoreOS 1688.3.0 to stable would cause serious damage to weave users... the resulting kernel panic also shows up as corrupted Docker images for me (files truncated to zero bytes).

@SpComb

This comment has been minimized.

SpComb commented Mar 15, 2018

CoreOS alpha 1702.1.0 on Linux 4.15.7-coreos seems fine, weave launches and works without any kernel warnings.

Seems to be something specific to the CoreOS beta 1688.3.0 Linux 4.14.24-coreos kernel. The CoreOS stable 1632.3.0 Linux 4.14.19-coreos kernel is working fine.

@SpComb

This comment has been minimized.

SpComb commented Mar 15, 2018

Repro steps with weave:

  1. wget https://github.com/weaveworks/weave/releases/download/v2.2.1/weave && chmod +x weave
  2. ./weave launch <peer IP>
  3. Wait for both peers to connect, and then use the serial console to observe the kernel panic

This is probably specific to openvswitch + vxlan, so I don't think it will happen on e.g. flannel?

BTW: be prepared for filesystem corruption on /var/lib/docker when this happens, e.g.

$ ./weave launch ...
2.2.1: Pulling from weaveworks/weave
Digest: sha256:7ffd8eb2a654d1660de56f20b73a1cd9552325dcc9ee0c64285e247ec0bc1098
Status: Image is up to date for weaveworks/weave:2.2.1
/run/torcx/bin/docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/home/weave/sigproxy\": stat /home/weave/sigproxy: no such file or directory": unknown.
@SpComb

This comment has been minimized.

SpComb commented Mar 15, 2018

This is probably the commit in the v4.14.24-coreos branch introducing the panic: coreos/linux@4699beb#diff-4f541554c5f8f378effc907c8f0c9115

This upstream commit pretty clearly references this kernel panic: torvalds/linux@f15ca72#diff-4f541554c5f8f378effc907c8f0c9115

Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
"BUG: unable to handle kernel NULL pointer dereference at (null)"

Let's add a helper to check if update_pmtu is available before calling it.

@squeed

This comment has been minimized.

squeed commented Mar 15, 2018

Good find.
It looks like the the bad patch was backported to 4.14.24, but the fix is only in 4.15. We should probably request that the fix be backported to 4.14

@DorianGray

This comment has been minimized.

DorianGray commented Mar 21, 2018

@DorianGray

This comment has been minimized.

DorianGray commented Mar 21, 2018

I should note, weave is used as the cluster networking layer.

@SpComb

This comment has been minimized.

SpComb commented Mar 22, 2018

I can confirm that backporting the patch is simple and does fix the issue

BTW: the patch in the linked forum thread is not a backport of the fix from 4.15, it's a revert of the problematic commit in 4.14.

@dm0-

This comment has been minimized.

Member

dm0- commented Mar 22, 2018

@squeed @SpComb Has the fix been requested for backport to the 4.14 branch? I didn't see it from a quick scroll through patchwork and the netdev archives yet. I could send the request to start the process, unless one of you still plans to do it.

@SpComb

This comment has been minimized.

SpComb commented Mar 28, 2018

It looks like the problematic CoreOS beta 1688.3.0 got promoted to CoreOS stable 1688.4.0 with the Linux 4.14.30 kernel still containing the buggy version of the vxlan driver: https://github.com/coreos/linux/commits/v4.14.30/drivers/net/vxlan.c

Still need to confirm this, but initial signs show that CoreOS stable nodes running weave are now kernel panicing after an update.

Ping @dm0- to escalate this - no, I don't think anyone here has requested a 4.14 kernel backport for the fix.

@SpComb

This comment has been minimized.

SpComb commented Mar 28, 2018

CoreOS stable updates for 1688.4.0 are currently paused for an unrelated reason (#2284), which also protects CoreOS stable nodes running weave from this issue for now: https://groups.google.com/forum/#!topic/coreos-user/5ihE2cKuYck

Confirm that newly provisioned CoreOS stable nodes using the 1688.4.0 image with Linux 4.14.30 are kernel panicing once weave establishes vxlan connections:

[  175.703912] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  175.704012] IP:           (null)
[  175.704012] PGD 800000001e4ed067 P4D 800000001e4ed067 PUD 1e4ee067 PMD 0 
[  175.704012] Oops: 0010 [#1] SMP PTI
[  175.704012] Modules linked in: xt_esp drbg seqiv esp4 xfrm4_mode_transport xt_policy xt_mark iptable_mangle veth dummy vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc overlay nfsv3 nfs_acl nfs lockd grace sunrpc fscache nls_ascii nls_cp437 vfat fat mousedev evdev psmouse i2c_piix4 i2c_core button sch_fq_codel ext4 crc16 mbcache jbd2 fscrypto dm_verity dm_bufio sd_mod crc32c_intel virtio_net ata_piix libata aesni_intel aes_x86_64 crypto_simd cryptd glue_helper virtio_pci virtio_ring virtio scsi_mod
[  175.704012]  dm_mirror dm_region_hash dm_log dm_mod dax
[  175.704012] CPU: 0 PID: 1600 Comm: weaver Tainted: G        W       4.14.30-coreos #1
[  175.704012] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  175.704012] task: ffff9ed67dae0000 task.stack: ffffb5c9c09b0000
[  175.704012] RIP: 0010:          (null)
[  175.704012] RSP: 0018:ffffb5c9c09b3828 EFLAGS: 00010286
[  175.704012] RAX: 0000000000000000 RBX: ffff9ed679e03ca8 RCX: 00000000000005aa
[  175.704012] RDX: ffff9ed679e03d00 RSI: 0000000000000000 RDI: ffff9ed679e03c00
[  175.704012] RBP: ffffb5c9c09b3910 R08: ffffb5c9c09b36f4 R09: 0000000000000000
[  175.704012] R10: 00000000c1a417d4 R11: 0000000020bea60d R12: ffff9ed679e03d00
[  175.704012] R13: ffff9ed679e03b00 R14: ffff9ed679e03b00 R15: ffff9ed677c68000
[  175.704012] FS:  00007f06661ee700(0000) GS:ffff9ed67fc00000(0000) knlGS:0000000000000000
[  175.704012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  175.704012] CR2: 0000000000000000 CR3: 000000003dad0005 CR4: 00000000000606f0
[  175.704012] Call Trace:
[  175.704012]  ? vxlan_dev_create+0x9d0/0x2d2d [vxlan]
[  175.704012]  ? vxlan_dev_create+0x2164/0x2d2d [vxlan]
[  175.704012]  ? vxlan_dev_create+0x2164/0x2d2d [vxlan]
[  175.704012]  ? dev_hard_start_xmit+0xa1/0x200
[  175.704012]  ? vxlan_dev_create+0x1f60/0x2d2d [vxlan]
[  175.704012]  ? dev_hard_start_xmit+0xa1/0x200
[  175.704012]  ? __dev_queue_xmit+0x678/0x7c0
[  175.704012]  ? 0xffffffffc04ffd34
[  175.704012]  ? __dev_queue_xmit+0x7c0/0x7c0
[  175.704012]  ? 0xffffffffc04ffd34
[  175.704012]  ? ovs_match_init+0x82a/0xd10 [openvswitch]
[  175.704012]  ? __kmalloc+0x191/0x210
[  175.704012]  ? ovs_execute_actions+0x48/0x110 [openvswitch]
[  175.704012]  ? ovs_execute_actions+0x48/0x110 [openvswitch]
[  175.704012]  ? action_fifos_exit+0x2e9/0x34e0 [openvswitch]
[  175.704012]  ? genl_family_rcv_msg+0x1e4/0x390
[  175.704012]  ? tcp_transmit_skb+0x545/0x9b0
[  175.704012]  ? genl_rcv_msg+0x47/0x90
[  175.704012]  ? __kmalloc_node_track_caller+0x222/0x2c0
[  175.704012]  ? genl_family_rcv_msg+0x390/0x390
[  175.704012]  ? netlink_rcv_skb+0x4d/0x130
[  175.704012]  ? genl_rcv+0x24/0x40
[  175.704012]  ? netlink_unicast+0x196/0x240
[  175.704012]  ? netlink_sendmsg+0x2b8/0x3b0
[  175.704012]  ? sock_sendmsg+0x36/0x40
[  175.704012]  ? SYSC_sendto+0x10e/0x140
[  175.704012]  ? __audit_syscall_entry+0xbc/0x110
[  175.704012]  ? syscall_trace_enter+0x1df/0x2e0
[  175.704012]  ? __do_page_fault+0x266/0x4c0
[  175.704012]  ? do_syscall_64+0x67/0x120
[  175.704012]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  175.704012] Code:  Bad RIP value.
[  175.704012] RIP:           (null) RSP: ffffb5c9c09b3828
[  175.704012] CR2: 0000000000000000
[  175.704012] ---[ end trace 7fa82df8d653d662 ]---
[  175.704012] Kernel panic - not syncing: Fatal exception in interrupt
[ 175.704012] Kernel Offset: 0x1000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
@bgilbert

This comment has been minimized.

Member

bgilbert commented Mar 28, 2018

@SpComb We need to issue a new stable release anyway, and will make sure it includes a fix. Thanks for your persistence.

@btalbot

This comment has been minimized.

btalbot commented Mar 28, 2018

Wow, so a cluster-crashing kernel bug reported on the coreos beta channels 2 weeks ago still make it into a stable-channel release, but just happen to get not-rolled-out due to some other (even more serious?) bug?

@bgilbert

This comment has been minimized.

Member

bgilbert commented Mar 28, 2018

This should be fixed in beta 1722.2.0 and stable 1688.5.0, due shortly.

@bgilbert bgilbert closed this Mar 28, 2018

@bgilbert

This comment has been minimized.

Member

bgilbert commented Mar 28, 2018

We're carrying coreos/linux@f5f2102 to fix this and have requested a backport.

@SpComb

This comment has been minimized.

SpComb commented Mar 29, 2018

Will the beta channel 1722.2.0 and stable channel 1688.5.0 releases happen simultaneously, or will the fixed 4.14.30 kernel be available in the beta channel for testing before getting released as a stable update?

I have not verified that the commit in 4.15 fixes the crash with weave in 4.14, although I'm hopeful it will... unfortunately I don't know how to test the 4.14.30-coreos kernel branch locally. Were you able to repro the kernel panic on 4.14.24-30 with weave, and verify that it was fixed in the new 4.14.30-r1 kernel?

@fabiorauber

This comment has been minimized.

fabiorauber commented Mar 29, 2018

I'm having similar issues with Rancher in a cluster with 1000+ containers. When they are created and removed rapidly, CoreOS 1632.3.0 panics with Fatal exception in interrupt. I couldn't get the full core dump yet. Kernel 4.14.19.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Mar 29, 2018

@SpComb The releases will likely be simultaneous. I was able to repro with 4.14.30 and verified that the repro failed on 4.14.30-r1. (Your instructions made that process trivial; thanks!)

@fabiorauber The broken patch was introduced in 4.14.24, so you may be seeing a different problem. Please file a new bug if the upcoming releases don't fix your issue.

@bgilbert

This comment has been minimized.

Member

bgilbert commented Mar 30, 2018

It turns out that 1688.5.0 was broken and not releasable, so this issue remains unresolved in stable. Beta 1722.2.0 will be rolling out shortly.

@bgilbert bgilbert reopened this Mar 30, 2018

@bgilbert

This comment has been minimized.

Member

bgilbert commented Apr 3, 2018

This issue should be fixed in stable 1688.5.3, which is rolling out now.

@bgilbert bgilbert closed this Apr 3, 2018

@SpComb

This comment has been minimized.

SpComb commented Apr 4, 2018

Confirmed that CoreOS beta 1722.2.0 with Linux 4.14.30-coreos-r1 and CoreOS stable 1688.5.3 with Linux 4.14.32-coreos both fix this issue, weave now works and is no longer panicking the kernel.

Thanks for the backport. I hope it eventually finds its way via netdev into the 4.14 stable tree as well, or this issue may start showing up in other distros too.

@lnehrin

This comment has been minimized.

lnehrin commented Apr 24, 2018

Damn. Amazon released 4.14.33-51.34.amzn1 yesterday or so as part of their "Amazon Linux version 2018.03" which definitely appears to have this issue. Hoping they also backport or patch.... otherwise weave panics the OS causing a continuous "groundhog day" reboot cycle. Had to terminate my failed instances and re-create from images of ECS instances that I hadn't yet broken by patching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment