Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java invoked oom-killer #22788

Closed
humbapa opened this issue Jan 25, 2017 · 8 comments

Comments

Projects
None yet
8 participants
@humbapa
Copy link

commented Jan 25, 2017

Elasticsearch version: 5.1.2 (and 2.4.4)

Plugins installed: []

JVM version: Oracle JDK 1.8.0_121 (and OpenJDK 1.8.0_111)

OS version: Ubuntu 16.04

Description of the problem including expected versus actual behavior:
On 2.4.1 the cluster was running without a problem. After upgrading to 2.4.4 I started to see "out of memory" errors. this happens randomly on all 8 data-nodes, which all have the same configuration and storage size.
As I was already prepared to upgrade to 5.x, I upgraded all nodes to 5.1.2. Unfortunately the nodes keep dying after I started to index some data again.
I then switched from OpenJDK to Oracle and also tried to reactivate the netty recycler (-Dio.netty.recycler.maxCapacityPerThread) but still get those "out of memory" errors.

Steps to reproduce:

  1. After the cluster has a green status, just wait for some hours and a random node will invoke the oom-killer

Provide logs (if relevant):

I only get the following Kernel Error:

Jan 25 15:25:20 es14 kernel: java invoked oom-killer: gfp_mask=0x26142c0, order=2, oom_score_adj=0
Jan 25 15:25:20 es14 kernel: java cpuset=/ mems_allowed=0
Jan 25 15:25:20 es14 kernel: CPU: 5 PID: 22114 Comm: java Not tainted 4.4.0-59-generic #80-Ubuntu
Jan 25 15:25:20 es14 kernel: Hardware name: Supermicro SYS-5039MS-H8TRF/X11SSD-F, BIOS 1.0 01/11/2016
Jan 25 15:25:20 es14 kernel:  0000000000000286 00000000ea5159af ffff88011bdfb8a8 ffffffff813f7583
Jan 25 15:25:20 es14 kernel:  ffff88011bdfba80 ffff88011020ad00 ffff88011bdfb918 ffffffff8120ad5e
Jan 25 15:25:20 es14 kernel:  ffff88107795a870 ffff88107795a860 ffffea0016f32480 0000000100000001
Jan 25 15:25:20 es14 kernel: Call Trace:
Jan 25 15:25:20 es14 kernel:  [<ffffffff813f7583>] dump_stack+0x63/0x90
Jan 25 15:25:20 es14 kernel:  [<ffffffff8120ad5e>] dump_header+0x5a/0x1c5
Jan 25 15:25:20 es14 kernel:  [<ffffffff81192722>] oom_kill_process+0x202/0x3c0
Jan 25 15:25:20 es14 kernel:  [<ffffffff81192b49>] out_of_memory+0x219/0x460
Jan 25 15:25:20 es14 kernel:  [<ffffffff81198abd>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
Jan 25 15:25:20 es14 kernel:  [<ffffffff81198eb6>] __alloc_pages_nodemask+0x286/0x2a0
Jan 25 15:25:20 es14 kernel:  [<ffffffff81198f6b>] alloc_kmem_pages_node+0x4b/0xc0
Jan 25 15:25:20 es14 kernel:  [<ffffffff811e973c>] kmalloc_large_node+0x2c/0x60
Jan 25 15:25:20 es14 kernel:  [<ffffffff811f0df2>] __kmalloc_node_track_caller+0x262/0x310
Jan 25 15:25:20 es14 kernel:  [<ffffffff81718637>] ? __alloc_skb+0x87/0x1f0
Jan 25 15:25:20 es14 kernel:  [<ffffffff81717101>] __kmalloc_reserve.isra.33+0x31/0x90
Jan 25 15:25:20 es14 kernel:  [<ffffffff8171860b>] ? __alloc_skb+0x5b/0x1f0
Jan 25 15:25:20 es14 kernel:  [<ffffffff81718637>] __alloc_skb+0x87/0x1f0
Jan 25 15:25:20 es14 kernel:  [<ffffffff8177b916>] sk_stream_alloc_skb+0x56/0x1b0
Jan 25 15:25:20 es14 kernel:  [<ffffffff8177c874>] tcp_sendmsg+0x824/0xb60
Jan 25 15:25:20 es14 kernel:  [<ffffffff817a80f5>] inet_sendmsg+0x65/0xa0
Jan 25 15:25:20 es14 kernel:  [<ffffffff8170fae8>] sock_sendmsg+0x38/0x50
Jan 25 15:25:20 es14 kernel:  [<ffffffff8170fb85>] sock_write_iter+0x85/0xf0
Jan 25 15:25:20 es14 kernel:  [<ffffffff8120e11b>] new_sync_write+0x9b/0xe0
Jan 25 15:25:20 es14 kernel:  [<ffffffff8120e186>] __vfs_write+0x26/0x40
Jan 25 15:25:20 es14 kernel:  [<ffffffff8120eb09>] vfs_write+0xa9/0x1a0
Jan 25 15:25:20 es14 kernel:  [<ffffffff8120f7c5>] SyS_write+0x55/0xc0
Jan 25 15:25:20 es14 kernel:  [<ffffffff818384f2>] entry_SYSCALL_64_fastpath+0x16/0x71
Jan 25 15:25:20 es14 kernel: Mem-Info:
Jan 25 15:25:20 es14 kernel: active_anon:431751 inactive_anon:4743 isolated_anon:0
Jan 25 15:25:20 es14 kernel: Node 0 DMA free:15876kB min:16kB low:20kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15960kB managed:15876kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jan 25 15:25:20 es14 kernel: lowmem_reserve[]: 0 1869 64201 64201 64201
Jan 25 15:25:20 es14 kernel: Node 0 DMA32 free:251824kB min:1964kB low:2452kB high:2944kB active_anon:5260kB inactive_anon:80kB active_file:0kB inactive_file:0kB unevictable:1683784kB isolated(anon):0kB isolated(file):0kB present:2033712kB managed:1953092kB mlocked:1683784kB dirty:0kB writeback:0kB mapped:824kB shmem:580kB slab_reclaimable:5536kB slab_unreclaimable:1208kB kernel_stack:128kB pagetables:4076kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jan 25 15:25:20 es14 kernel: lowmem_reserve[]: 0 0 62332 62332 62332
Jan 25 15:25:20 es14 kernel: Node 0 Normal free:107828kB min:65600kB low:82000kB high:98400kB active_anon:1721744kB inactive_anon:18892kB active_file:15876236kB inactive_file:15954636kB unevictable:28564300kB isolated(anon):0kB isolated(file):0kB present:64876544kB managed:63828168kB mlocked:28564300kB dirty:49116kB writeback:0kB mapped:3011992kB shmem:33152kB slab_reclaimable:1168120kB slab_unreclaimable:49304kB kernel_stack:6960kB pagetables:197244kB unstable:0kB bounce:0kB free_pcp:792kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jan 25 15:25:20 es14 kernel: lowmem_reserve[]: 0 0 0 0 0
Jan 25 15:25:20 es14 kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15876kB
Jan 25 15:25:20 es14 kernel: Node 0 DMA32: 140*4kB (UMEH) 92*8kB (UMEH) 64*16kB (UMEH) 53*32kB (UME) 22*64kB (UMEH) 19*128kB (UMH) 23*256kB (UME) 11*512kB (UM) 9*1024kB (UM) 3*2048kB (UME) 53*4096kB (UM) = 251824kB
Jan 25 15:25:20 es14 kernel: Node 0 Normal: 25762*4kB (UME) 586*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 107736kB
Jan 25 15:25:20 es14 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 25 15:25:20 es14 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 25 15:25:20 es14 kernel: 7971730 total pagecache pages
Jan 25 15:25:20 es14 kernel: 0 pages in swap cache
Jan 25 15:25:20 es14 kernel: Swap cache stats: add 0, delete 0, find 0/0
Jan 25 15:25:20 es14 kernel: Free swap  = 0kB
Jan 25 15:25:20 es14 kernel: Total swap = 0kB
Jan 25 15:25:20 es14 kernel: 16731554 pages RAM
Jan 25 15:25:20 es14 kernel: 0 pages HighMem/MovableOnly
Jan 25 15:25:20 es14 kernel: 282270 pages reserved
Jan 25 15:25:20 es14 kernel: 0 pages cma reserved
Jan 25 15:25:20 es14 kernel: 0 pages hwpoisoned
Jan 25 15:25:20 es14 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Jan 25 15:25:20 es14 kernel: [  481]     0   481    22241    12371      45       3        0             0 systemd-journal
Jan 25 15:25:20 es14 kernel: [  534]     0   534     9849     1051      20       4        0         -1000 systemd-udevd
Jan 25 15:25:20 es14 kernel: [ 1381]   108  1381    27684     1422      56       3        0          -900 dbus-daemon
Jan 25 15:25:20 es14 kernel: [ 1423]     0  1423     4929      500      15       3        0             0 atd
Jan 25 15:25:20 es14 kernel: [ 1438]     0  1438     5098      564      13       3        0             0 systemd-logind
Jan 25 15:25:20 es14 kernel: [ 1449]     0  1449     1100      293       9       3        0             0 acpid
Jan 25 15:25:20 es14 kernel: [ 1458]     0  1458     5350      712      15       3        0             0 cron
Jan 25 15:25:20 es14 kernel: [ 1463]   104  1463    80522     1528      57       4        0             0 rsyslogd
Jan 25 15:25:20 es14 kernel: [ 1554]     0  1554   176004     2598      68       3        0             0 nscd
Jan 25 15:25:20 es14 kernel: [ 2118]     0  2118    14858     1212      32       3        0         -1000 sshd
Jan 25 15:25:20 es14 kernel: [ 2137]     0  2137     1606      170       9       3        0             0 agetty
Jan 25 15:25:20 es14 kernel: [ 2150] 10023  2150     3469      924      15       3        0          -500 nrpe
Jan 25 15:25:20 es14 kernel: [ 2157]   110  2157    26033      884      23       3        0             0 ntpd
Jan 25 15:25:20 es14 kernel: [ 2265]     0  2265    15413     1080      22       3        0             0 master
Jan 25 15:25:20 es14 kernel: [ 2267]   111  2267    15440     1090      21       3        0             0 qmgr
Jan 25 15:25:20 es14 kernel: [22041]   112 22041 166246848  8386734   49054     637        0             0 java
Jan 25 15:25:20 es14 kernel: [24878]     0 24878  2526142   341234     820       9        0             0 java
Jan 25 15:25:20 es14 kernel: [28283]   111 28283    15400      655      21       3        0             0 pickup
Jan 25 15:25:20 es14 kernel: Out of memory: Kill process 22041 (java) score 512 or sacrifice child
Jan 25 15:25:20 es14 kernel: Killed process 22041 (java) total-vm:664987392kB, anon-rss:30648528kB, file-rss:2898356kB

There is also Logstash running on this node, but the error also happens if it is stopped.

@MiLk

This comment has been minimized.

Copy link

commented Jan 27, 2017

Same issue here.

It's probably related to the kernel.
See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400 and https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842

We are reverting to 4.4.0-57-generic.

@humbapa

This comment has been minimized.

Copy link
Author

commented Jan 27, 2017

Thanks for the hint :-)

I'm also going back to 4.4.0-57 from 4.4.0-59 and report on Monday whether I still see those OOM issues.

@humbapa

This comment has been minimized.

Copy link
Author

commented Jan 30, 2017

With Ubuntu-Kernel 4.4.0-57 I no longer have those "Out of memory" issues.

Thanks for your help

@humbapa humbapa closed this Jan 30, 2017

@simenflatby

This comment has been minimized.

Copy link

commented Mar 24, 2017

Have anyone experienced similar behavior with the following versions?

  • Ubuntu 16.04.1 LTS
  • elasticsearch 2.4.0
  • oracle-java8-installer 8u101+8u101arm-1~webupd8~2
  • kernel 4.4.0-67-generic
@ywilkof

This comment has been minimized.

Copy link

commented May 8, 2017

Just to confirm, we had similar issue in our production environment running ES 2.4.4 with Ubuntu-16 with kernel 4.4.0-59. Linux was sacrificing the cluster nodes randomly due to OOM.
As suggested in this post, we downgraded half of the machines to 4.4.0-57. The other half we upgraded to 4.4.0-77 and since then the issue has not happened (we're running with this setup for a few days now), where before it would happen multiple times a day. Feeling lucky I found this thread ;)

@gpstathis

This comment has been minimized.

Copy link
Contributor

commented May 15, 2017

Confirming this happening with ES 2.3.3 on Ubuntu 16.04.1 LTS and 4.4.0-71-generic kernel.

@jannemann

This comment has been minimized.

Copy link

commented Aug 1, 2017

Wo do habe the same problem with ES 1.7.4 on Ubuntu 14.04.1 LTS, 4.4.0-83-generic Kernel and Oracle Java JVM 1.8.0_131

@rtkmhart

This comment has been minimized.

Copy link

commented Aug 9, 2017

Same with Elasticsearch 2.4.5, Oracle JVM 1.8.0_131, Ubuntu 16.04 LTS, linux-aws kernel 4.4.0-1028.37 which we believe has kernel 4.4.0.89-112.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.