Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AArch64: Failure in SharedClasses.SCM01.MultiThreadMultiCL_0 #8987

Closed
knn-k opened this issue Mar 26, 2020 · 11 comments
Closed

AArch64: Failure in SharedClasses.SCM01.MultiThreadMultiCL_0 #8987

knn-k opened this issue Mar 26, 2020 · 11 comments
Labels
arch:aarch64 segfault Issues that describe segfaults / JVM crashes test failure

Comments

@knn-k
Copy link
Contributor

knn-k commented Mar 26, 2020

Failure link
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_aarch64_linux_Personal/4/

SharedClasses.SCM01.MultiThreadMultiCL_0 crashes in libj9jit29.so intermittently.

MTM1 stderr Unhandled exception
MTM1 stderr Type=Segmentation error vmState=0x00000000
MTM1 stderr J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
MTM1 stderr Handler1=0000FFFF8FCBD2C0 Handler2=0000FFFF8FB79C8C InaccessibleAddress=00000000D65F03C8
MTM1 stderr R0=0000000000000000 R1=0000000000002EE7 R2=D65F03C0F9400C00 R3=0000FFFF8F6E2AD0
MTM1 stderr R4=0000000000000041 R5=0000000000A9ACB0 R6=0000FFFF8C505D00 R7=0000FFFF8EF7B8F8
MTM1 stderr R8=00000000000086C5 R9=0000FFFF53516E7C R10=0000000000B60D08 R11=0000000000000001
MTM1 stderr R12=0000000000000000 R13=2F676E616C2F6176 R14=003690B2EEA54B28 R15=0000FFFF8F4C71E4
MTM1 stderr R16=0000FFFF8FC30328 R17=0000FFFF90B3AE1C R18=0000000000000001 R19=00000000D65F03C0
MTM1 stderr R20=0000000000A9ACB0 R21=000000000000051E R22=0000FFFF8C4C25D0 R23=0000000000000001
MTM1 stderr R24=0000000000A9ACD0 R25=0000000000000041 R26=0000000000000014 R27=0000000000000001
MTM1 stderr R28=0000000000000000 R29=0000FFFDE70FDE20 R30=0000FFFF8EF8352C R31=0000FFFDE70FDCC0
MTM1 stderr PC=0000FFFF8EF7CF68 SP=0000FFFDE70FDCC0 PSTATE=0000000000000000
(omit FP register lines)
MTM1 stderr Module=/home/jenkins/workspace/Test_openjdk11_j9_extended.system_aarch64_linux_Personal/openjdkbinary/j2sdk-image/lib/compressedrefs/libj9jit29.so
MTM1 stderr Module_base_address=0000FFFF8ED90000
MTM1 stderr Target=2_90_20200324_45 (Linux 4.14.0-115.2.2.el7a.aarch64)
MTM1 stderr CPU=aarch64 (96 logical CPUs) (0x1fcd7e0000 RAM)
@knn-k knn-k added test failure arch:aarch64 segfault Issues that describe segfaults / JVM crashes labels Mar 26, 2020
@knn-k
Copy link
Contributor Author

knn-k commented Mar 26, 2020

My observations:

  • This is not reproducible on a quad-core AArch64 device. The test server has 96 CPU cores.
  • I reproduced the crash 5 times out of 7 runs on the 96-core server.
  • The location of the crash is always in the JIT IProfiler thread, but in different functions from a crash to another.
  • If you run the test with IBM_JAVA_OPTIONS=-Xjit:disableIprofilerDataCollection or with IBM_JAVA_OPTIONS=-Xjit:disableInterpreterProfilingThread, it passes. (I tried more than 10 times each).

An example of crash location in the IProfiler thread:

#13 <signal handler called>
#14 0x0000ffffa6fd826c in searchForMethodSample (bucket=6788, omb=0x8a9cb0, this=0xffffa45325d0)
    at /root/openj9-openjdk-jdk11/build/linux-aarch64-normal-server-release/vm/compiler/../compiler/runtime/IProfiler.cpp:1373
#15 TR_IProfiler::findOrCreateMethodEntry (this=this@entry=0xffffa45325d0, callerMethod=callerMethod@entry=0x8a9cd0,
    calleeMethod=0x8a9cb0, addIt=addIt@entry=true, pcIndex=65)
    at /root/openj9-openjdk-jdk11/build/linux-aarch64-normal-server-release/vm/compiler/../compiler/runtime/IProfiler.cpp:1130
#16 0x0000ffffa6fde9e0 in TR_IProfiler::parseBuffer (this=0xffffa45325d0, vmThread=0x46af00,
    dataStart=0xfffca00073f0 "\261\230-H\377\377", size=1019, verboseReparse=<optimized out>)
    at /root/openj9-openjdk-jdk11/build/linux-aarch64-normal-server-release/vm/compiler/../compiler/runtime/IProfiler.cpp:4144
#17 0x0000ffffa6fdece8 in TR_IProfiler::processWorkingQueue (this=this@entry=0xffffa45325d0)
    at /root/openj9-openjdk-jdk11/build/linux-aarch64-normal-server-release/vm/compiler/../compiler/runtime/IProfiler.cpp:3937
#18 0x0000ffffa6fded94 in iprofilerThreadProc (entryarg=<optimized out>)
    at /root/openj9-openjdk-jdk11/build/linux-aarch64-normal-server-release/vm/compiler/../compiler/runtime/IProfiler.cpp:3665

@knn-k
Copy link
Contributor Author

knn-k commented Mar 26, 2020

Is there any other JIT option to try?

@knn-k
Copy link
Contributor Author

knn-k commented Apr 2, 2020

eclipse/omr#5014 will fix this.

@knn-k
Copy link
Contributor Author

knn-k commented Apr 2, 2020

@knn-k
Copy link
Contributor Author

knn-k commented Apr 7, 2020

This is no longer reproducible. Closing.

@knn-k knn-k closed this as completed Apr 7, 2020
@M-Davies
Copy link

Seen some recurrences of this on https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_j9_sanity.openjdk_aarch64_linux/140/consoleFull (test-aws-rhel76-armv8-1)

02:43:49  stderr:
02:43:49  Unhandled exception
02:43:49  Type=Segmentation error vmState=0x00000000
02:43:49  J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
02:43:49  Handler1=0000FFFF7E66E51C Handler2=0000FFFF7E529C38 InaccessibleAddress=0000000000000008
02:43:49  R0=0000000000000000 R1=0000FFFF599ADB98 R2=0000000000000008 R3=0000FFFF7E265C50
02:43:49  R4=0000000000000000 R5=0000FFFF7E265E30 R6=0000000000000002 R7=0000FFFF5A008E29
02:43:49  R8=0000000000000002 R9=000000000000000A R10=0000000000000004 R11=0000000000000006
02:43:49  R12=000000000000FFB0 R13=00000003E8000000 R14=001EAE23AAD24C41 R15=0000FFFF599AE110
02:43:49  R16=0000FFFF7E240148 R17=0000FFFF7DA04AB8 R18=0000000000000004 R19=0000FFFF780B6EE0
02:43:49  R20=0000FFFF7E23F000 R21=0000FFFF599ADB98 R22=0000FFFF7C7F0E90 R23=0000FFFF599ADB6F
02:43:49  R24=0000FFFF780B3160 R25=00000000FE98BBB0 R26=00000000FE98BBC8 R27=0000000000000050
02:43:49  R28=000000000000001B R29=0000FFFF599ADC10 R30=0000FFFF7D9DD494 R31=0000FFFF599ADAE0
02:43:49  PC=0000FFFF7D9DD494 SP=0000FFFF599ADAE0 PSTATE=0000000060000000
02:43:49  V0 4000000000000000 (f: 0.000000, d: 2.000000e+00)
02:43:49  V1 c1e0000000000000 (f: 0.000000, d: -2.147484e+09)
02:43:49  V2 000000003f800000 (f: 1065353216.000000, d: 5.263544e-315)
02:43:49  V3 000000003e130d44 (f: 1041435968.000000, d: 5.145377e-315)
02:43:49  V4 000000003c1e96e0 (f: 1008637696.000000, d: 4.983332e-315)
02:43:49  V5 00000000390e7854 (f: 957249600.000000, d: 4.729442e-315)
02:43:49  V6 0000000041700000 (f: 1097859072.000000, d: 5.424145e-315)
02:43:49  V7 978b8699c0013676 (f: 3221304832.000000, d: -2.945863e-195)
02:43:49  V8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V16 5a8279995a827999 (f: 1518500224.000000, d: 1.000489e+128)
02:43:49  V17 6ed9eba16ed9eba1 (f: 1859775360.000000, d: 9.594426e+225)
02:43:49  V18 8f1bbcdc8f1bbcdc (f: 2400959744.000000, d: -6.815449e-236)
02:43:49  V19 ca62c1d6ca62c1d6 (f: 3395469824.000000, d: -2.193092e+50)
02:43:49  V20 55afa36849bd3f66 (f: 1237139328.000000, d: 5.668939e+104)
02:43:49  V21 61ee486f8a63f84c (f: 2321807360.000000, d: 5.449616e+163)
02:43:49  V22 efcdab8967452301 (f: 1732584192.000000, d: -3.598696e+230)
02:43:49  V23 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V24 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V25 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V26 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V27 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V28 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
02:43:49  V30 000000003f400000 (f: 1061158912.000000, d: 5.242822e-315)
02:43:49  V31 000000003f400000 (f: 1061158912.000000, d: 5.242822e-315)
02:43:49  Module=/home/jenkins/workspace/Test_openjdk11_j9_sanity.openjdk_aarch64_linux/openjdkbinary/j2sdk-image/lib/compressedrefs/libj9jit29.so
02:43:49  Module_base_address=0000FFFF7D8D0000
02:43:49  Target=2_90_20200512_313 (Linux 4.14.0-115.2.2.el7a.aarch64)
02:43:49  CPU=aarch64 (4 logical CPUs) (0x1c89d0000 RAM)

Dumps at: https://ibm.box.com/shared/static/o1o2k3hybtncfi9ld7kr7e6t9h1j37kz.gz

I kicked off some grinders for each of the tests that threw the error on the same machine, see if I can't continually reproduce it

NOTE: Also seen again on https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_j9_sanity.openjdk_aarch64_linux_xl/107/consoleFull Dumps for that are here: https://ibm.box.com/shared/static/3wxlwawf8jm0hqefeqbaxerw8qzv1lzg.gz

@M-Davies
Copy link

M-Davies commented May 12, 2020

https://ci.adoptopenjdk.net/job/Grinder/3021/ saw some failures similar to the ones above.

Running a Grinder with a single iteration of tools/pack200/Pack200Test.java throws the failure seen https://ci.adoptopenjdk.net/job/Grinder/3026/.

Seems to occur across aarch machines https://ci.adoptopenjdk.net/job/Grinder/3027/. Will look into the failing cases for https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_j9_sanity.openjdk_aarch64_linux_xl/107/consoleFull to see if it's just tools/pack200/Pack200Test.java that is the problem here

@pshipton pshipton reopened this May 12, 2020
@M-Davies
Copy link

Looks like it's just Pack200Test.java. The other failing tests in https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_j9_sanity.openjdk_aarch64_linux_xl/107/consoleFull don't show the error anymore:

java/lang/String/SpecialCasingTest.java.SpecialCasingTest
java/util/Calendar/GregorianCutoverTest.java.GregorianCutoverTest
java/util/concurrent/BlockingQueue/PollMemoryLeak.java.PollMemoryLeak
java/util/zip/ZipFile/TestCleaner.java.TestCleaner

@M-Davies
Copy link

Also seen for tests

java/util/concurrent/forkjoin/Integrate.java.Integrate
java/util/concurrent/forkjoin/NQueensCS.java.NQueensCS
java/util/logging/TestConfigurationListeners.java.TestConfigurationListeners

on machine test-packet-armv8-ubuntu-16-04
Ref https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_j9_sanity.openjdk_aarch64_linux/141

@knn-k
Copy link
Contributor Author

knn-k commented May 15, 2020

All the crashes in https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_j9_sanity.openjdk_aarch64_linux/142 occurred at the following location in TR::DefaultCompilationStrategy::processEvent() because TR::Recompilation::getJittedBodyInfoFromPC() returns NULL.
https://github.com/eclipse/openj9/blob/a8d0a6e4d119b0e571d2da1ef030649f2e824413/runtime/compiler/control/CompilationController.cpp#L203-L206

I am working on enabling JIT recompilation right now, and I think that will fix these crashes.

@knn-k
Copy link
Contributor Author

knn-k commented May 15, 2020

I opened #9574 for tracking the crashes with getJittedBodyInfoFromPC(), because they are independent from the crash with SharedClasses.SCM01.MultiThreadMultiCL_0 in this issue.

@knn-k knn-k closed this as completed May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch:aarch64 segfault Issues that describe segfaults / JVM crashes test failure
Projects
None yet
Development

No branches or pull requests

3 participants