Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust JIT memory usage based on cgroup limits #1371

Closed
ashu-mehra opened this issue Mar 7, 2018 · 21 comments
Closed

Adjust JIT memory usage based on cgroup limits #1371

ashu-mehra opened this issue Mar 7, 2018 · 21 comments

Comments

@ashu-mehra
Copy link
Contributor

Portlibrary API for getting memory limit imposed by cgroup are now available in OMR and OpenJ9 has an option to enable cgroups which will cause GC to size its heap properly when -Xmx is not specified.
In addition to GC, @mpirvu suggested JIT should also be using cgroup memory limits to size its usage.

Comment from Marius:

I have another use case: The JIT is dynamically sizing its memory usage limits based on the amount of free physical memory (trying to avoid an OOM case). We get that information from omrsysinfo_get_memory_info which populates a J9MemoryInfo struct and we compute the sum of memInfo->availPhysical, memInfo->cached, memInfo->buffered. These values have meaning for the entire system, but I would like to take into account the limits imposed by the cgroups as well.
Without knowing what's available I was thinking something along the lines:
(1) Get physical memory used by JVM process and info from omrsysinfo_get_physical_memory which takes into account cgroup limits. The difference is the physical memory the JVM is still allowed to allocate, so we need to adjust the JIT memory usage limit to always keep some safe reserve.
(2) The logic above is not enough because other processes or containers may use some of physical memory that this JVM could allocate. Thus I have to use omrsysinfo_get_memory_info too and then take the minimum between availablePhysical and value computed at step (1).

Its interesting that omrsysinfo_get_memory_info() also determines total memory as omrsysinfo_get_physical_memory() does but they use different mechanism.

Given that we want to update portlibrary APIs to use cgroup limits transparently, I think omrsysinfo_get_memory_info() should be updated on same lines as omrsysinfo_get_physical_memory() has been done, and should return values based on cgroup limits, if available, transparently.
If we do that, then all the memory stats that omrsysinfo_get_memory_info() calculates would then be in context of the cgroup and not of the whole system when JVM is in a cgroup. After this is done, JIT can continue using omrsysinfo_get_memory_info() in the same way as it is being used.

However, as Marius mentioned, even when we are running in a cgroup, we do need to take into account system wide value to get the correct value of available memory. This is because cgroup does not provide any mechanism to reserve memory, it just puts upper limit on the memory usage of the cgroup. This means other processes on the host (or other cgroups) can eat up into this cgroup's memory.
But if we update omrsysinfo_get_memory_info() to return cgroup stats as per current design, then there won't be any API that would return the system wide available memory.

Given all these things, my proposal is:

  1. Update omrsysinfo_get_memory_info() to use memory stats exposed by memory controller of the cgroup.
  2. Add a new port library API (something like omrsysinfo_get_available_memory()) to return available memory. This API would encapsulate the logic that Marius mentioned, that is return minimum of available memory in the system and available memory in the cgroup.
  3. Update JIT to use the new API mentioned in (2), instead of omrsysinfo_get_memory_info().

fyi - @DanHeidinga @charliegracie @pshipton

@DanHeidinga
Copy link
Member

Didn't we discuss a similar API point in relation to the GC? I vaguely recall something about having a single API and passing in an enum to indicate if it should be the cgroup limit or the system limit.

@ashu-mehra
Copy link
Contributor Author

@DanHeidinga I remember that discussion. It happened when we wanted only GC, and no other component, to use cgroup limits if a specific option is specified. GC would have to decide which limit, cgroup or system, to use and pass an enum to the portlibrary API.
But there was another discussion where we decided to change the existing APIs such that they would return cgroup limits transparently. My proposal is based on this discussion.
Moreover, now cgroup limits if enabled via -XX:+UseContainerSupport are applicable for all cgroup subsystems and to all components of JVM, unlike the earlier situation where we wanted only GC to be using cgroup limits.

@ashu-mehra
Copy link
Contributor Author

@DanHeidinga your thoughts?

@DanHeidinga
Copy link
Member

I'm hoping to give this more this week. I haven't had a chance to sit down and really think about the implications of this yet.

@DanHeidinga
Copy link
Member

If there is only going to be one place where we need the results from both, I suggest that location should make both calls.

An API should only be added when there is a reasonable expectation that multiple callers will want the need the info.

@ashu-mehra
Copy link
Contributor Author

@DanHeidinga I believe you are referring to omrsysinfo_get_available_memory(). If we don't add this API, we would need a mechanism to get memory stats from proc and from the cgroup, so that JIT can use minimum of the two.
Given that we are already updating omrsysinfo_get_memory_info to return cgroup stats transparently when running in container (eclipse/omr#2430), we don't have any API that would return memory stats from proc when running in container.
Moreover, I believe this is the kind of detail (i.e. fetching the memory stats for the host and container and picking minimum of the two) which JIT should not bother about, isn't it? Better to hide it in port library?

@mpirvu
Copy link
Contributor

mpirvu commented Jun 12, 2018

I agree with Ashu's comments above:

  1. The best place for the logic that determines what the JVM can afford to allocate is the portlib. We plan to use that for JIT scratch memory and quite possibly for code cache / data cache, so the JIT will have to define a new routine, but why not have it in the portlib so that other components can use it too if they want.
  2. If we make, as planned, omrsysinfo_get_memory_info aware of containers by default, then we still need another API for getting available physical memory at the machine level (from /proc). This new API should definitely be in the portlib.

@DanHeidinga
Copy link
Member

If we don't add this API, we would need a mechanism to get memory stats from proc and from the cgroup, so that JIT can use minimum of the two.

So we'd have to add a new api either way? That changes my answer.

One of my concerns with the approach is that OMR already has 3 memory-related APIs that appear to have a high degree of overlap:

  • omrsysinfo_get_memory_info
  • omrsysinfo_get_addressable_physical_memory
  • omrsysinfo_get_physical_memory

Adding a 4th won't help this situation.

What about extending the J9MemoryInfo structure to include some additional fields that hold the "raw" / non-cgroup used memory?
https://github.com/eclipse/omr/blob/e2baea03b8cc06f247c27a0a71b42b40379a0cf6/include_core/omrport.h#L559-L569

In the non-cgroup case, these new fields would be assigned the same value as the original fields. They would only potentially differ if the cgroup limits are enabled.

@mpirvu
Copy link
Contributor

mpirvu commented Jun 13, 2018

Your proposal about extending the J9MemoryInfo sounds good to me.

@ashu-mehra
Copy link
Contributor Author

@DanHeidinga @mpirvu I have added only those fields to J9MemoryInfo which we would need to calculate available memory on the host. These fields are added only for Linux platform. Please see the second commit in eclipse/omr#2430.
I am assuming with this change we don't need the new API omrsysinfo_get_available_memory now, right?

@mpirvu
Copy link
Contributor

mpirvu commented Jun 14, 2018

Why the limitation on Linux?

I am assuming with this change we don't need the new API omrsysinfo_get_available_memory now, right?

Correct, if I have the values for the entire machine I can implement the required logic in a JIT function. We can export that later if there are other consumers.

@ashu-mehra
Copy link
Contributor Author

Why the limitation on Linux?

Because the cgroup is a Linux thing and that's where we would need to store host memory stats separately. On other systems, these new fields are not really required.

@ashu-mehra
Copy link
Contributor Author

Why the limitation on Linux?

Dan has also asked it to be platform neutral. I will update the code.

@ashu-mehra
Copy link
Contributor Author

@mpirvu fyi - PR eclipse/omr#2430 is merged now.

@mpirvu
Copy link
Contributor

mpirvu commented Jul 18, 2018

Thanks, I am working on it.

@huntc
Copy link

huntc commented Dec 8, 2018

We have been using -XX:+UseContainerSupport and then seeing the Linux OOM manager kill our container given that it exceeds the mem quota. My feeling is that our problem is in relation to this issue.

Is there any workaround while this issue continues to be debated? Thanks.

@mpirvu
Copy link
Contributor

mpirvu commented Dec 9, 2018

This issue was fixed by PR #2546

Apart from the Java heap and the memory used by the JIT compiler, the JVM uses memory for other components as well: classes and other VM data structures, GC data structures, JIT persistent/runtime data structures. Any of these could have pushed the JVM over the container limits.
To diagnose this issue, my suggestion is to set the heap limit to the same value it uses now, increase the container memory limit to avoid the OOM and generate a "javacore" by sending "kill -3" to the JVM process at a point where you guess the OOM would have happened. Then we can inspect the javacore to see which components use a lot of memory.

@mpirvu mpirvu closed this as completed Dec 9, 2018
Container-aware JVM automation moved this from To do to Done Dec 9, 2018
@huntc
Copy link

huntc commented Dec 9, 2018

Thanks @mpirvu - do you know if this fix is within the container I have?

$ docker run -it ibmcom/ibmjava:8-sfj-alpine sh
/ # java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 8.0.5.25 - pxa6480sr5fp25-20181030_01(SR5 FP25) Small Footprint)
IBM J9 VM (build 2.9, JRE 1.8.0 Linux amd64-64-Bit Compressed References 20181029_400846 (JIT enabled, AOT enabled)
OpenJ9   - c5c78da
OMR      - 3d5ac33
IBM      - 8c1bdc2)
JCL - 20181022_01 based on Oracle jdk8u191-b26

It is hard to track back...

@mpirvu
Copy link
Contributor

mpirvu commented Dec 10, 2018

Yes sr5fp25 has this fix.

@huntc
Copy link

huntc commented Dec 13, 2018

Thanks for that @mpirvu.

So, given that we're using the fix that you referenced, here's a report from the Linux OOM reaper when things die for us:

[Thu Dec 13 00:21:30 2018] JIT Compilation invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[Thu Dec 13 00:21:30 2018] JIT Compilation cpuset=lxc-23527-fdp.libvirt-lxc mems_allowed=0
[Thu Dec 13 00:21:30 2018] CPU: 0 PID: 23660 Comm: JIT Compilation Not tainted 4.1.49-rt52-yocto-standard #1
[Thu Dec 13 00:21:30 2018] Hardware name: Lynx Software Technologies, Inc.  , BIOS Version TRUNK (ENGINEERING) 12/20/2017
[Thu Dec 13 00:21:30 2018]  0000000000000000 ffff8800377abc98 ffffffff819b8a4f ffff880037d7e100
[Thu Dec 13 00:21:30 2018]  ffff880006c6d000 ffff8800377abd48 ffffffff8112c7f5 ffff8800377abcf8
[Thu Dec 13 00:21:30 2018]  0000000000000202 ffff8800381b3f10 ffffffff81136068 00000000000000fd
[Thu Dec 13 00:21:30 2018] Call Trace:
[Thu Dec 13 00:21:30 2018]  [<ffffffff819b8a4f>] dump_stack+0x63/0x81
[Thu Dec 13 00:21:30 2018]  [<ffffffff8112c7f5>] dump_header.isra.6+0x75/0x210
[Thu Dec 13 00:21:30 2018]  [<ffffffff81136068>] ? __page_cache_release+0x28/0x130
[Thu Dec 13 00:21:30 2018]  [<ffffffff81136b1f>] ? put_page+0x3f/0x60
[Thu Dec 13 00:21:30 2018]  [<ffffffff8112cf50>] oom_kill_process+0x1c0/0x3a0
[Thu Dec 13 00:21:30 2018]  [<ffffffff81176f3f>] ? mem_cgroup_iter+0x1df/0x430
[Thu Dec 13 00:21:30 2018]  [<ffffffff811799f9>] mem_cgroup_oom_synchronize+0x579/0x5b0
[Thu Dec 13 00:21:30 2018]  [<ffffffff81176b00>] ? mem_cgroup_can_attach+0x150/0x150
[Thu Dec 13 00:21:30 2018]  [<ffffffff8112d66f>] pagefault_out_of_memory+0x1f/0xc0
[Thu Dec 13 00:21:30 2018]  [<ffffffff8104af55>] mm_fault_error+0x75/0x160
[Thu Dec 13 00:21:30 2018]  [<ffffffff8104b460>] __do_page_fault+0x420/0x430
[Thu Dec 13 00:21:30 2018]  [<ffffffff819bb519>] ? __schedule+0x2b9/0x5f0
[Thu Dec 13 00:21:30 2018]  [<ffffffff8104b492>] do_page_fault+0x22/0x30
[Thu Dec 13 00:21:30 2018]  [<ffffffff819c0fa8>] page_fault+0x28/0x30
[Thu Dec 13 00:21:30 2018] Task in /apphosting.partition/lxc-23527-fdp.libvirt-lxc killed as a result of limit of /apphosting.partition/lxc-23527-fdp.libvirt-lxc
[Thu Dec 13 00:21:30 2018] memory: usage 524288kB, limit 524288kB, failcnt 6107
[Thu Dec 13 00:21:30 2018] memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
[Thu Dec 13 00:21:30 2018] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[Thu Dec 13 00:21:30 2018] Memory cgroup stats for /apphosting.partition/lxc-23527-fdp.libvirt-lxc: cache:856KB rss:523432KB rss_huge:0KB mapped_file:128KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:523428KB inactive_file:440KB active_file:416KB unevictable:4KB
[Thu Dec 13 00:21:30 2018] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[Thu Dec 13 00:21:30 2018] [23527]     0 23527    19793     1756      41       3        0             0 libvirt_lxc
[Thu Dec 13 00:21:30 2018] [23529] 100000 23529      395       31       6       3        0             0 startcontainer.
[Thu Dec 13 00:21:30 2018] [23562] 100000 23562   481477   129889     353       6        0             0 java
[Thu Dec 13 00:21:30 2018] [23658] 100000 23658    41674      945      49       3        0             0 virsh
[Thu Dec 13 00:21:30 2018] [23659] 100000 23659      393        1       5       3        0             0 sh
[Thu Dec 13 00:21:30 2018] Memory cgroup out of memory: Kill process 23562 (java) score 993 or sacrifice child
[Thu Dec 13 00:21:30 2018] Killed process 23562 (java) total-vm:1925908kB, anon-rss:519556kB, file-rss:0kB

It looks as though the problem occurs during a JIT on startup.

Here are the options we're using on startup:

-XX:+UseContainerSupport
-XX:MaxRAMPercentage=30
-Xss384k
-Xscmx24m

@huntc
Copy link

huntc commented Dec 14, 2018

Rather than continue describing my issue here, I've now opened up a new one so that it can be discussed in a stand-alone manner: #4050

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants