New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIX OOMs in extended.system SharedClasses MultiThread testing #8842
Comments
@jdekonin could the machines have changed, this doesn't seem to have been caused by a JVM or test change. |
Ah, nm, I missed a change, this is caused by ibmruntimes/openj9-openjdk-jdk11#272 |
no, not aware of that. especially, it doesn't make sense for 64bit. let me post a question to AIX folks. |
i assumed errno==4 being returned upon pthread_create call? according to the message above |
Why 4 (INTR)? pthread_create is returning 11 (EAGAIN), which is "The limit on the number of threads per process has been reached.". We did see a similar problem creating threads because of the data limit, but we modified the VM to set the data soft limit to the hard limit. |
i was looking at "J9THREAD_ERR_OS_ERRNO_SET 0x40000000" ... implying 4 seemed. |
cryptic error message |
AIX kernel folks responded: I would say: short of ulimit issue, it is unimaginable that malloc failed for 64bit. if we can limit good & bad drivers to be without/with that PR, then we can experiment further with settings: |
by the way, default is 32 heaps (max as well) without :n. |
I do see buckets option later on. Let's play with/without it as well ... for example: |
https://ci.eclipse.org/openj9/view/Test/job/Grinder/666 - multiheap,buckets - failed https://ci.eclipse.org/openj9/view/Test/job/Grinder/661 - multiheap:4 - passed |
16 and 16,buckets passed too. could you launch explicit 32 and 32,buckets build/tests? |
for pinning down the symptom, maybe it is good to launch 24,buckets ... 28,buckets, and 29/30/31,buckets. Let's see when it cracks. |
https://ci.eclipse.org/openj9/view/Test/job/Grinder/671 - multiheap:32 - passed I'm only running one iteration of each test. Now 24,buckets has failed, but 28,buckets passed. |
that seemed indicating EAGAIN can happen unpredictably, depending on timing, for pthread_create call. assuming the same number of threads created in each run, that leaves only one possibility that virtual or data memory ulimit is exceeded unpredictably with massive concurrent malloc/free going on. it could be momentarily exceeded though. 32,buckets failed around 14700 threads; while 24,buckets failed around 18000 threads. 32,buckets have more fragmentations leading to the momentarily exceeding. it makes sense to me. ",buckets" option makes it doubly fragmented with many concurrent malloc/free. could you find out the ulimit for virtual memory and data memory for these runs? once we know the reason, we can decide on a fix. |
https://ci.eclipse.org/openj9/view/Test/job/Grinder/666 (effectively 32,buckets) failed around 17400 threads, while the later 32,buckets run failed around 14700 threads. indeed, it is unpredictable. my theory seemed standing ... |
https://ci.eclipse.org/openj9/view/Test/job/Grinder/676 - multiheap:16,buckets x 3 - passed |
if all are unlimited, we might need to bring it forward to AIX kernel team. Logically at least. Depending on how to interpret EAGAIN. |
From https://ci.eclipse.org/openj9/view/Test/job/Grinder/674 - multiheap:24,buckets javacore
|
it is likely RSS being exceeded momentarily. didn't see "virtual memory" (-v) though. |
make RSS unlimited and retry ... the eventual fix can go: without ,buckets. need to confirm with performance benefit/data of ",buckets" from security test cases. |
Is there a particular setting I should try with RSS unlimited?
So I'll try a run with |
ulimit -m unlimited (for RSS unlimited) yes, we can try a few multiheap alone runs. |
yes, but with which testcase, |
yes, with any of the previously failing setting(s), e.g. multiheap,buckets |
With ulimit -m unlimited: https://ci.eclipse.org/openj9/view/Test/job/Grinder/683 - multiheap:24,buckets x 3 - passed Without changing ulimit: |
Updated results in the previous comment. |
@zl-wang is someone looking at the performance of using |
@chao.shan@ibm.com could you compare on AIX crypto performance test cases between MALLOCOPTIONS=multiheap vs. MALLOCOPTIONS=multiheap,buckets? |
jdk11
|
nice. now, @shanchao95 performance comparison is the critical factor for fix decision. |
jdk 11
|
@zl-wang Sorry I didn't see this issue since my external handle is wrong there. @sophiaxu0424 Can we please get the data requested in #8842 (comment)? As Peter recommend in #8842 (comment), we should use the 0.18.1 release build. Thanks! |
@zl-wang Sure, here is performance for |
@shanchao95 thanks for the data. but I am confused: previously between multiheap setting and no-setting, for 16-thread tests on small 512 payload, the performance improvement was from a few times to 100s times. however, this batch of data didn't show that at all. I am wondering if your driver is after the code merge such that your no-setting actually means the current setting in the merge. |
@zl-wang I found a different version of jdk was used for this performance test. Previously i used jvm**0317 from espresso, but this is espresso 0327. A new run with jdk0317 is running. It will be updated once it is done. |
@zl-wang performance updated |
@shanchao95 thanks a lot for the data. summary:
what is the converged fix? that is a good question. I inclined to multiheap,considersize for moderate RSS plus most of the performance benefit. Note somewhere: multiheap,bucket setting may give another 100% to 40% performance boost for certain applications. |
@zl-wang we need to finalize the setting to be used for the 0.20.0 release this week. For the time being, I'm going to change to the head stream to "multiheap,considersize" to avoid failing the tests every night. It sounds like you are considering a perf/footprint comparison between |
The previous setting `MALLOCOPTIONS=multiheap` uses too much memory in some cases. Issue eclipse-openj9/openj9#8842 Signed-off-by: Peter Shipton <Peter_Shipton@ca.ibm.com>
The previous setting `MALLOCOPTIONS=multiheap` uses too much memory in some cases. Issue eclipse-openj9/openj9#8842 Signed-off-by: Peter Shipton <Peter_Shipton@ca.ibm.com>
The previous setting `MALLOCOPTIONS=multiheap` uses too much memory in some cases. Issue eclipse-openj9/openj9#8842 Signed-off-by: Peter Shipton <Peter_Shipton@ca.ibm.com>
The previous setting `MALLOCOPTIONS=multiheap` uses too much memory in some cases. Issue eclipse-openj9/openj9#8842 Signed-off-by: Peter Shipton <Peter_Shipton@ca.ibm.com>
I am fine with the proposed |
@pshipton sorry for my ambiguity. additional bucket option only had marginal performance advantage. i was referring to the extreme case of tuning in performance guide or something. i was proposing multiheap,considersize as the fix. @vijaysun-omr JCE performance data I summarized here: #8842 (comment) |
Starup/footpirnt runs have finished. Please see: javanext/issues/176#issuecomment-19050257
Setting-2 keeps failing the jobs and might need some help to figure it out @zl-wang @piyush286 FYI. |
The previous setting `MALLOCOPTIONS=multiheap` uses too much memory in some cases. Issue eclipse-openj9/openj9#8842 Signed-off-by: Peter Shipton <Peter_Shipton@ca.ibm.com>
The previous setting `MALLOCOPTIONS=multiheap` uses too much memory in some cases. Issue eclipse-openj9/openj9#8842 Signed-off-by: Peter Shipton <Peter_Shipton@ca.ibm.com>
The previous setting `MALLOCOPTIONS=multiheap` uses too much memory in some cases. Issue eclipse-openj9/openj9#8842 Signed-off-by: Peter Shipton <Peter_Shipton@ca.ibm.com>
The change to use "multiheap,considersize" is merged to the head stream, but not yet to the 0.20.0 branches |
The changes are merged to the 0.20.0 branches. Closing this issue, a new issue can be opened for any further improvements. |
"java/lang/OutOfMemoryError" "Failed to create a thread: retVal -1073741830, errno 11"
11 EAGAIN The limit on the number of threads per process has been reached.
-1073741830 == 0xBFFFFFFA, negated is 0x40000006
J9THREAD_ERR_THREAD_CREATE_FAILED 6
J9THREAD_ERR_OS_ERRNO_SET 0x40000000
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_ppc64_aix_Nightly/311/
aix71-p8-4
SharedClasses.SCM01.MultiThread_0
SharedClasses.SCM23.MultiThread_0
https://ci.eclipse.org/openj9/job/Test_openjdk14_j9_extended.system_ppc64_aix_Nightly/14/
aix71-p8-1
SharedClasses.SCM01.MultiThread_0
SharedClasses.SCM23.MultiThread_0
SharedClasses.SCM23.MultiThreadMultiCL_0
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_extended.system_ppc64_aix_Nightly/313/
SharedClasses.SCM01.MultiThread_0
SharedClasses.SCM23.MultiThread_0
Changes from previous passing build (from jdk11).
a1ed808...c209fa5
eclipse-openj9/openj9-omr@b03105e...79f6485
ibmruntimes/openj9-openjdk-jdk11@b1d6957...7d1badb
adoptium/aqa-tests@8737892...e459da2
The text was updated successfully, but these errors were encountered: