AIX OOMs in extended.system SharedClasses MultiThread testing #8842

pshipton · 2020-03-12T13:35:59Z

"java/lang/OutOfMemoryError" "Failed to create a thread: retVal -1073741830, errno 11"
11 EAGAIN The limit on the number of threads per process has been reached.
-1073741830 == 0xBFFFFFFA, negated is 0x40000006
J9THREAD_ERR_THREAD_CREATE_FAILED 6
J9THREAD_ERR_OS_ERRNO_SET 0x40000000

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_ppc64_aix_Nightly/311/
aix71-p8-4
SharedClasses.SCM01.MultiThread_0
SharedClasses.SCM23.MultiThread_0

https://ci.eclipse.org/openj9/job/Test_openjdk14_j9_extended.system_ppc64_aix_Nightly/14/
aix71-p8-1
SharedClasses.SCM01.MultiThread_0
SharedClasses.SCM23.MultiThread_0
SharedClasses.SCM23.MultiThreadMultiCL_0

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_extended.system_ppc64_aix_Nightly/313/
SharedClasses.SCM01.MultiThread_0
SharedClasses.SCM23.MultiThread_0

Changes from previous passing build (from jdk11).

a1ed808...c209fa5
eclipse-openj9/openj9-omr@b03105e...79f6485
ibmruntimes/openj9-openjdk-jdk11@b1d6957...7d1badb

adoptium/aqa-tests@8737892...e459da2

The text was updated successfully, but these errors were encountered:

pshipton · 2020-03-12T13:53:17Z

@jdekonin could the machines have changed, this doesn't seem to have been caused by a JVM or test change.

pshipton · 2020-03-12T14:07:21Z

Ah, nm, I missed a change, this is caused by ibmruntimes/openj9-openjdk-jdk11#272
@zl-wang @shanchao95
Zulian, this seems problematic. Is it expected that "MALLOCOPTIONS=multiheap,buckets" would reduce the number of threads that can be created?

zl-wang · 2020-03-12T14:32:20Z

no, not aware of that. especially, it doesn't make sense for 64bit. let me post a question to AIX folks.

zl-wang · 2020-03-12T14:51:15Z

i assumed errno==4 being returned upon pthread_create call? according to the message above

pshipton · 2020-03-12T14:56:26Z

Why 4 (INTR)? pthread_create is returning 11 (EAGAIN), which is "The limit on the number of threads per process has been reached.".

We did see a similar problem creating threads because of the data limit, but we modified the VM to set the data soft limit to the hard limit.
#8367

zl-wang · 2020-03-12T15:06:26Z

i was looking at "J9THREAD_ERR_OS_ERRNO_SET 0x40000000" ... implying 4 seemed.

zl-wang · 2020-03-12T15:07:43Z

cryptic error message

zl-wang · 2020-03-12T16:11:39Z

AIX kernel folks responded:
pthread_create returns EAGAIN for exceeding the thread count ulimit or failure to allocate (malloc) memory for internal pthread data structures.

I would say:
Let's dig a little deeper into the javacore file: what is the ulimit? how many threads are currently active?

short of ulimit issue, it is unimaginable that malloc failed for 64bit.

if we can limit good & bad drivers to be without/with that PR, then we can experiment further with settings:
MALLOCOPTIONS=multiheap:4
MALLOCOPTIONS=multiheap:8
MALLOCOPTIONS=multiheap:16
(dropping ,buckets which I didn't see in the manual anyway)

zl-wang · 2020-03-12T16:14:24Z

by the way, default is 32 heaps (max as well) without :n.

zl-wang · 2020-03-12T16:54:25Z

I do see buckets option later on. Let's play with/without it as well ... for example:
MALLOCOPTIONS=multiheap:4,buckets

pshipton · 2020-03-13T13:55:15Z

https://ci.eclipse.org/openj9/view/Test/job/Grinder/666 - multiheap,buckets - failed

https://ci.eclipse.org/openj9/view/Test/job/Grinder/661 - multiheap:4 - passed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/662 - multiheap:4,buckets - passed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/663 - multiheap:8 - passed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/664 - multiheap:8,buckets - passed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/667 - multiheap:16 - passed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/668 - multiheap:16, buckets - passed

zl-wang · 2020-03-13T14:24:13Z

16 and 16,buckets passed too. could you launch explicit 32 and 32,buckets build/tests?

zl-wang · 2020-03-13T14:31:56Z

for pinning down the symptom, maybe it is good to launch 24,buckets ... 28,buckets, and 29/30/31,buckets. Let's see when it cracks.

pshipton · 2020-03-13T18:30:06Z

https://ci.eclipse.org/openj9/view/Test/job/Grinder/671 - multiheap:32 - passed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/672 - multiheap:32,buckets - failed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/673 - multiheap:28,buckets - passed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/674 - multiheap:24,buckets - failed

I'm only running one iteration of each test. Now 24,buckets has failed, but 28,buckets passed.

zl-wang · 2020-03-13T19:33:52Z

that seemed indicating EAGAIN can happen unpredictably, depending on timing, for pthread_create call. assuming the same number of threads created in each run, that leaves only one possibility that virtual or data memory ulimit is exceeded unpredictably with massive concurrent malloc/free going on. it could be momentarily exceeded though.

32,buckets failed around 14700 threads; while 24,buckets failed around 18000 threads. 32,buckets have more fragmentations leading to the momentarily exceeding. it makes sense to me. ",buckets" option makes it doubly fragmented with many concurrent malloc/free.

could you find out the ulimit for virtual memory and data memory for these runs? once we know the reason, we can decide on a fix.

zl-wang · 2020-03-13T19:38:47Z

https://ci.eclipse.org/openj9/view/Test/job/Grinder/666 (effectively 32,buckets) failed around 17400 threads, while the later 32,buckets run failed around 14700 threads. indeed, it is unpredictable. my theory seemed standing ...

pshipton · 2020-03-13T19:49:32Z

https://ci.eclipse.org/openj9/view/Test/job/Grinder/676 - multiheap:16,buckets x 3 - passed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/677 - multiheap:20,buckets x 3 - passed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/678 - multiheap:22,buckets x 3 - passed

zl-wang · 2020-03-13T19:49:47Z

if all are unlimited, we might need to bring it forward to AIX kernel team. Logically at least. Depending on how to interpret EAGAIN.

pshipton · 2020-03-13T20:03:33Z

From https://ci.eclipse.org/openj9/view/Test/job/Grinder/674 - multiheap:24,buckets javacore

1CIUSERLIMITS  User Limits (in bytes except for NOFILE and NPROC)
NULL           ------------------------------------------------------------------------
NULL           type                            soft limit           hard limit
2CIUSERLIMIT   RLIMIT_AS                        unlimited            unlimited
2CIUSERLIMIT   RLIMIT_CORE                      unlimited            unlimited
2CIUSERLIMIT   RLIMIT_CPU                       unlimited            unlimited
2CIUSERLIMIT   RLIMIT_DATA                      unlimited            unlimited
2CIUSERLIMIT   RLIMIT_FSIZE                     unlimited            unlimited
2CIUSERLIMIT   RLIMIT_NOFILE                    unlimited            unlimited
2CIUSERLIMIT   RLIMIT_THREADS                   unlimited            unlimited
2CIUSERLIMIT   RLIMIT_NPROC                     unlimited            unlimited
2CIUSERLIMIT   RLIMIT_RSS                        33554432            unlimited
2CIUSERLIMIT   RLIMIT_STACK                    4294967296           4294967296

zl-wang · 2020-03-13T20:12:04Z

it is likely RSS being exceeded momentarily. didn't see "virtual memory" (-v) though.

zl-wang · 2020-03-13T20:18:29Z

make RSS unlimited and retry ...

the eventual fix can go: without ,buckets. need to confirm with performance benefit/data of ",buckets" from security test cases.

pshipton · 2020-03-13T20:27:35Z

make RSS unlimited and retry ...

Is there a particular setting I should try with RSS unlimited?

the eventual fix can go: without ,buckets

So I'll try a run with MALLOCOPTIONS=multiheap then?

zl-wang · 2020-03-13T20:48:40Z

ulimit -m unlimited (for RSS unlimited)

yes, we can try a few multiheap alone runs.

pshipton · 2020-03-13T21:01:32Z

ulimit -m unlimited (for RSS unlimited)

yes, but with which testcase, multiheap:24,buckets?

zl-wang · 2020-03-13T21:10:30Z

yes, but with which testcase, multiheap:24,buckets?

yes, with any of the previously failing setting(s), e.g. multiheap,buckets

pshipton · 2020-03-13T21:10:55Z

With ulimit -m unlimited:

https://ci.eclipse.org/openj9/view/Test/job/Grinder/683 - multiheap:24,buckets x 3 - passed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/684 - multiheap:32,buckets - failed
https://ci.eclipse.org/openj9/view/Test/job/Grinder/685 - multiheap,buckets - failed

Without changing ulimit:
https://ci.eclipse.org/openj9/view/Test/job/Grinder/686 - multiheap x 3 - passed

pshipton · 2020-03-15T15:39:10Z

Updated results in the previous comment.

pshipton · 2020-03-16T16:51:04Z

@zl-wang is someone looking at the performance of using MALLOCOPTIONS=multiheap ?

zl-wang · 2020-03-17T12:37:26Z

@chao.shan@ibm.com could you compare on AIX crypto performance test cases between MALLOCOPTIONS=multiheap vs. MALLOCOPTIONS=multiheap,buckets?

pshipton · 2020-03-30T15:36:29Z

jdk11 MALLOCOPTIONS=multiheap,considersize

11:22:03  MT3 stdout of command: ps v 6226366
11:22:03  MT3      PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
11:22:03  MT3  6226366      - A     4:10 4607 157172 157300 32768    75   128  4.9  2.0 /home/u
11:22:03  MT3 stderr of command: ps v 6226366
11:22:03  MT1 stdout of command: ps v 7799084
11:22:03  MT5 stdout of command: ps v 7405844
11:22:03  MT5      PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
11:22:03  MT5  7405844      - A     4:17  772 166196 166324 32768    75   128  5.1  2.0 /home/u
11:22:03  MT1      PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
11:22:03  MT1  7799084      - A     4:04 2714 158452 158580 32768    75   128  4.8  2.0 /home/u
11:22:03  MT1 stderr of command: ps v 7799084
11:22:03  MT5 stderr of command: ps v 7405844
11:22:06  MT2 stdout of command: ps v 4718840
11:22:06  MT2      PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
11:22:06  MT2  4718840      - A     4:15  719 164980 165108 32768    75   128  4.9  2.0 /home/u
11:22:06  MT2 stderr of command: ps v 4718840
11:22:09  MT4 stdout of command: ps v 8519920
11:22:09  MT4      PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
11:22:09  MT4  8519920      - A     4:04  723 157684 157812 32768    75   128  4.7  2.0 /home/u
11:22:09  MT4 stderr of command: ps v 8519920

zl-wang · 2020-03-30T15:43:50Z

nice. now, @shanchao95 performance comparison is the critical factor for fix decision.

pshipton · 2020-03-30T16:07:03Z

jdk 11 MALLOCOPTIONS=multiheap:18

12:01:47  MT5 stdout of command: ps v 11927744
12:01:47  MT5       PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
12:01:47  MT5  11927744      - A     4:10 1501 685620 685748 65536    75   128  5.0  8.0 /home/
12:01:47  MT5 stderr of command: ps v 11927744
12:01:47  MT1 stdout of command: ps v 16711774
12:01:47  MT1       PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
12:01:47  MT1  16711774      - A     4:05 6352 705204 705332 65536    75   128  4.8  9.0 /home/
12:01:47  MT1 stderr of command: ps v 16711774
12:01:51  MT2 stdout of command: ps v 16253166
12:01:51  MT2       PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
12:01:51  MT2  16253166      - A     4:05  840 677428 677556 65536    75   128  4.8  8.0 /home/
12:01:51  MT2 stderr of command: ps v 16253166
12:01:54  MT3 stdout of command: ps v 16056322
12:01:56  MT3       PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
12:01:56  MT3  16056322      - A     4:15  773 590900 591028 65536    75   128  4.9  7.0 /home/
12:01:56  MT3 stderr of command: ps v 16056322
12:01:57  MT4 stdout of command: ps v 14090452
12:01:57  MT4       PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
12:01:57  MT4  14090452      - A     4:20 1134 588340 588468 65536    75   128  5.0  7.0 /home/
12:01:57  MT4 stderr of command: ps v 14090452

piyush286 · 2020-03-30T18:36:31Z

@zl-wang Sorry I didn't see this issue since my external handle is wrong there.

@sophiaxu0424 Can we please get the data requested in #8842 (comment)? As Peter recommend in #8842 (comment), we should use the 0.18.1 release build. Thanks!

shanchao95 · 2020-03-31T14:08:58Z

@zl-wang Sure, here is performance for MALLOCOPTIONS=multiheap,considersize, MALLOCOPTIONS=multiheap:18 and MALLOCOPTIONS=.
performance.xlsx

zl-wang · 2020-04-01T21:42:32Z

@shanchao95 thanks for the data. but I am confused: previously between multiheap setting and no-setting, for 16-thread tests on small 512 payload, the performance improvement was from a few times to 100s times. however, this batch of data didn't show that at all.

I am wondering if your driver is after the code merge such that your no-setting actually means the current setting in the merge.

shanchao95 · 2020-04-02T14:38:36Z

@zl-wang I found a different version of jdk was used for this performance test. Previously i used jvm**0317 from espresso, but this is espresso 0327. A new run with jdk0317 is running. It will be updated once it is done.

shanchao95 · 2020-04-06T13:52:13Z

@zl-wang performance updated
performance.xlsx

zl-wang · 2020-04-06T14:49:23Z

@shanchao95 thanks a lot for the data. summary:

multiheap,considersize is 50% to 70% of performance of multiheap which is 10x to 100x performance of no-setting, for multi-threaded cases;
multiheap,considersize memory consumption/RSS is in the same ballpark as no-setting. However, multiheap RSS have two cases:
2a) intensive multi-thread malloc/free: it is more than 5x of no-setting RSS (e.g. this PR test case)
2b) sporadic malloc/free: no impact ... in the same ballpark as no-setting (e.g. DT7 test)

what is the converged fix? that is a good question. I inclined to multiheap,considersize for moderate RSS plus most of the performance benefit. Note somewhere: multiheap,bucket setting may give another 100% to 40% performance boost for certain applications.

pshipton · 2020-04-06T16:37:34Z

@zl-wang we need to finalize the setting to be used for the 0.20.0 release this week. For the time being, I'm going to change to the head stream to "multiheap,considersize" to avoid failing the tests every night.

It sounds like you are considering a perf/footprint comparison between multiheap,considersize and multiheap,bucket,considersize. Can you make an explicit request.

The previous setting `MALLOCOPTIONS=multiheap` uses too much memory in some cases. Issue eclipse-openj9/openj9#8842 Signed-off-by: Peter Shipton <Peter_Shipton@ca.ibm.com>

vijaysun-omr · 2020-04-06T17:24:44Z

I am fine with the proposed considersize setting but should we get the JCE performance team to also measure with this setting on AIX to get the performance picture on the multithreaded scenarios they were running on ?

zl-wang · 2020-04-06T17:47:22Z

@pshipton sorry for my ambiguity. additional bucket option only had marginal performance advantage. i was referring to the extreme case of tuning in performance guide or something. i was proposing multiheap,considersize as the fix.

@vijaysun-omr JCE performance data I summarized here: #8842 (comment)

sophiaxu0424 · 2020-04-06T19:10:44Z

Starup/footpirnt runs have finished. Please see: javanext/issues/176#issuecomment-19050257
1- no setting
2- export MALLOCOPTIONS=multiheap
3- export MALLOCOPTIONS=multiheap,considersize

1	No Setting
		Startup time in ms	Footprint in kb
	mean	12044.2292	436334.75
	CI	2.57%	0.54%

3	MALLOCOPTIONS="multiheap,considersize"
	mean	11996.8125	436044.333
	CI	2.17%	0.56%
	withSetting/noSetting	99.61%	99.93%

Setting-2 keeps failing the jobs and might need some help to figure it out @zl-wang @piyush286 FYI.

The previous setting `MALLOCOPTIONS=multiheap` uses too much memory in some cases. Issue eclipse-openj9/openj9#8842 Signed-off-by: Peter Shipton <Peter_Shipton@ca.ibm.com>

pshipton · 2020-04-07T04:06:14Z

The change to use "multiheap,considersize" is merged to the head stream, but not yet to the 0.20.0 branches

pshipton · 2020-04-08T21:24:48Z

The changes are merged to the 0.20.0 branches. Closing this issue, a new issue can be opened for any further improvements.

pshipton added the test failure label Mar 12, 2020

pshipton closed this as completed Apr 8, 2020

AIX OOMs in extended.system SharedClasses MultiThread testing #8842

AIX OOMs in extended.system SharedClasses MultiThread testing #8842

Comments

pshipton commented Mar 12, 2020 • edited

pshipton commented Mar 12, 2020

pshipton commented Mar 12, 2020

zl-wang commented Mar 12, 2020

zl-wang commented Mar 12, 2020

pshipton commented Mar 12, 2020 • edited

zl-wang commented Mar 12, 2020

zl-wang commented Mar 12, 2020

zl-wang commented Mar 12, 2020

zl-wang commented Mar 12, 2020

zl-wang commented Mar 12, 2020

pshipton commented Mar 13, 2020 • edited

zl-wang commented Mar 13, 2020

zl-wang commented Mar 13, 2020

pshipton commented Mar 13, 2020

zl-wang commented Mar 13, 2020

zl-wang commented Mar 13, 2020

pshipton commented Mar 13, 2020

zl-wang commented Mar 13, 2020

pshipton commented Mar 13, 2020

zl-wang commented Mar 13, 2020

zl-wang commented Mar 13, 2020

pshipton commented Mar 13, 2020

zl-wang commented Mar 13, 2020

pshipton commented Mar 13, 2020

zl-wang commented Mar 13, 2020

pshipton commented Mar 13, 2020 • edited

pshipton commented Mar 15, 2020

pshipton commented Mar 16, 2020

zl-wang commented Mar 17, 2020

pshipton commented Mar 30, 2020

zl-wang commented Mar 30, 2020

pshipton commented Mar 30, 2020

piyush286 commented Mar 30, 2020

shanchao95 commented Mar 31, 2020

zl-wang commented Apr 1, 2020

shanchao95 commented Apr 2, 2020 • edited

shanchao95 commented Apr 6, 2020

zl-wang commented Apr 6, 2020 • edited

pshipton commented Apr 6, 2020

vijaysun-omr commented Apr 6, 2020

zl-wang commented Apr 6, 2020

sophiaxu0424 commented Apr 6, 2020 • edited

pshipton commented Apr 7, 2020

pshipton commented Apr 8, 2020

pshipton commented Mar 12, 2020 •

edited

pshipton commented Mar 12, 2020 •

edited

pshipton commented Mar 13, 2020 •

edited

pshipton commented Mar 13, 2020 •

edited

shanchao95 commented Apr 2, 2020 •

edited

zl-wang commented Apr 6, 2020 •

edited

sophiaxu0424 commented Apr 6, 2020 •

edited