Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jdk15+36: java/foreign/TestMismatch.java.TestMismatch J9 Crash #10588

Closed
andrew-m-leonard opened this issue Sep 14, 2020 · 79 comments
Closed

jdk15+36: java/foreign/TestMismatch.java.TestMismatch J9 Crash #10588

andrew-m-leonard opened this issue Sep 14, 2020 · 79 comments
Labels
arch:power comp:jit prio:high segfault Issues that describe segfaults / JVM crashes test failure

Comments

@andrew-m-leonard
Copy link
Contributor

andrew-m-leonard commented Sep 14, 2020

Failure link

https://ci.adoptopenjdk.net/job/Test_openjdk15_j9_sanity.openjdk_ppc64le_linux/43/consoleFull
java/foreign/TestMismatch.java
Fails on platforms: Platforms: pLinux, xLinuxXL

Optional info

  • intermittent failure (yes|no): possibly
  • regression or new test: Regression
  • if regression, what are the last passing / first failing public SHAs (OpenJ9, OMR, JCL) :

Failure output (captured from console output)

14:13:28  testDifferentValues s1:MemorySegment{ id=0x6c3ff368 limit: 1 }, s2:MemorySegment{ id=0x266c2bbe limit: 1 }
14:13:28  test TestMismatch.testDifferentValues(jdk.internal.foreign.HeapMemorySegmentImpl@6bd39stderr:
14:13:28  Unhandled exception
14:13:28  Type=Segmentation error vmState=0x00000000
14:13:28  J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
14:13:28  Handler1=0000794DBD6EA980 Handler2=0000794DBD5479A0
14:13:28  R0=0000000080000009 R1=0000794D9F01AC50 R2=0000000000000000 R3=0000794D36000000
14:13:28  R4=FFFFFFFFFC000000 R5=0000000000000000 R6=0000794DBE4701C8 R7=0000000000000000
14:13:28  R8=0000000000000001 R9=0000000000000020 R10=0000000000000000 R11=FFFFFFFFFFFFF001
14:13:28  R12=0000000000000000 R13=0000794D9F0268E0 R14=0000000000324DC0 R15=000000000031F900
14:13:28  R16=0000794D9D020038 R17=FFFFFFFFFFFFFFFF R18=0000794D9F01AE70 R19=0000000000000000
14:13:28  R20=0000000000324F90 R21=00000000002C0000 R22=000000000031F900 R23=0000000000000001
14:13:28  R24=0000000080000009 R25=0000000000000000 R26=0000794CB5FF0040 R27=0000000080000009
14:13:28  R28=0000000000055160 R29=00000000002C0000 R30=0000000080000009 R31=0000000000000001
14:13:28  NIP=0000794D9F771FD4 MSR=800000000280D033 ORIG_GPR3=C000000000337DB4 CTR=FFFFFFFFF7FFF802
14:13:28  LINK=0000794D9F771F84 XER=0000000020000000 CCR=0000000048004282 SOFTE=0000000000000001
14:13:28  TRAP=0000000000000300 DAR=0000794D36000000 dsisr=0000000042000000 RESULT=0000000000000000
14:13:28  FPR0 0000794d3d8b09b9 (f: 1032522176.000000, d: 6.589485e-310)
14:13:28  FPR1 40452f4e40000000 (f: 1073741824.000000, d: 4.236958e+01)
14:13:28  FPR2 4059000000000000 (f: 0.000000, d: 1.000000e+02)
14:13:28  FPR3 3fee666660000000 (f: 1610612736.000000, d: 9.500000e-01)
14:13:28  FPR4 3fce840b4ac4e4d2 (f: 1254417664.000000, d: 2.384047e-01)
14:13:28  FPR5 bfe7154748bef6c8 (f: 1220474624.000000, d: -7.213475e-01)
14:13:28  FPR6 3fe62e42fefa39ef (f: 4277811712.000000, d: 6.931472e-01)
14:13:28  FPR7 3fbc5e53aa362eb4 (f: 2855677696.000000, d: 1.108143e-01)
14:13:28  FPR8 006f0076006e0049 (f: 7209033.000000, d: 1.379626e-306)
14:13:28  FPR9 002000720065006b (f: 6619243.000000, d: 4.450632e-308)
14:13:28  FPR10 3809c01c80000000 (f: 2147483648.000000, d: 9.459216e-39)
14:13:28  FPR11 0000000000000040 (f: 64.000000, d: 3.162020e-322)
14:13:28  FPR12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR13 0000000000000001 (f: 1.000000, d: 4.940656e-324)
14:13:28  FPR14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR16 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR17 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR18 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR19 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR20 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR21 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR22 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR23 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR24 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR25 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR26 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR27 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR28 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR30 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  FPR31 0000000000000000 (f: 0.000000, d: 0.000000e+00)
14:13:28  
14:13:28  Compiled_method=jdk/internal/foreign/NativeMemorySegmentImpl.makeNativeSegment(JJ)Ljdk/incubator/foreign/MemorySegment;
14:13:28  Target=2_90_20200911_34 (Linux 4.15.0-45-generic)
14:13:28  CPU=ppc64le (4 logical CPUs) (0x1fe0c0000 RAM)
14:13:28  ----------- Stack Backtrace -----------
14:13:28  (0x0000794D9F771FD4 [<unknown>+0x0])
14:13:28  (0x0000794DBD6D23AC [libj9vm29.so+0x923ac])
14:13:28  (0x0000794DBD6D36D8 [libj9vm29.so+0x936d8])
14:13:28  (0x0000794DBC4E55C4 [libjclse29.so+0x755c4])
14:13:28  JVM_InvokeMethod+0x40 (0x0000794DBD872970 [libjvm.so+0x22970])
14:13:28  JVM_InvokeMethod+0x34 (0x0000794DBE216884 [libjvm.so+0x6884])
14:13:28  Java_jdk_internal_reflect_NativeMethodAccessorImpl_invoke0+0x24 (0x0000794D9CC9BFA4 [libjava.so+0xbfa4])
14:13:28  (0x0000794D9F4F59A4 [<unknown>+0x0])
14:13:28  (0x0000794DBD6D23AC [libj9vm29.so+0x923ac])
14:13:28  (0x0000794DBD6D36D8 [libj9vm29.so+0x936d8])
14:13:28  (0x0000794DBC4E55C4 [libjclse29.so+0x755c4])
14:13:28  JVM_InvokeMethod+0x40 (0x0000794DBD872970 [libjvm.so+0x22970])
14:13:28  JVM_InvokeMethod+0x34 (0x0000794DBE216884 [libjvm.so+0x6884])
14:13:28  Java_jdk_internal_reflect_NativeMethodAccessorImpl_invoke0+0x24 (0x0000794D9CC9BFA4 [libjava.so+0xbfa4])
14:13:28  (0x0000794D9F4F59A4 [<unknown>+0x0])
14:13:28  (0x0000794DBD6CD6B4 [libj9vm29.so+0x8d6b4])
14:13:28  (0x0000794DBD740C30 [libj9vm29.so+0x100c30])
14:13:28  (0x0000794DBD548DA8 [libj9prt29.so+0x28da8])
14:13:28  (0x0000794DBD73C264 [libj9vm29.so+0xfc264])
14:13:28  (0x0000794DBD611248 [libj9thr29.so+0x11248])
14:13:28  (0x0000794DBE4B885C [libpthread.so.0+0x885c])
14:13:28  clone+0x98 (0x0000794DBE3A9028 [libc.so.6+0x159028])
14:13:28  ---------------------------------------

@JasonFengJ9
Copy link
Member

It appears #10580 with different error output.

@DanHeidinga
Copy link
Member

Compiled_method=jdk/internal/foreign/NativeMemorySegmentImpl.makeNativeSegment(JJ)Ljdk/incubator/foreign/MemorySegment;

fyi @andrewcraik

@DanHeidinga DanHeidinga added arch:power arch:x86 segfault Issues that describe segfaults / JVM crashes labels Sep 14, 2020
@DanHeidinga DanHeidinga added this to the Release 0.22 (Java 15) milestone Sep 14, 2020
@DanHeidinga
Copy link
Member

We need a stop-ship determination on this issue. Given the crash is in java.foreign package which I believe is still in preview, I don't think will be but we should still look at the crash to be sure it's a broader issue

@andrewcraik
Copy link
Contributor

@liqunl sorry to interrupt your dev work but could you take a look as this is lined up against the release that will be going out imminently.

@liqunl
Copy link
Contributor

liqunl commented Sep 14, 2020

@andrew-m-leonard @llxia What should be the target if I want to rerun the failed test?

@liqunl
Copy link
Contributor

liqunl commented Sep 14, 2020

@andrew-m-leonard You mentioned the test also failed on xlinux, could you provide the link to the x86 failure?

@liqunl
Copy link
Contributor

liqunl commented Sep 14, 2020

#10580 is a failure in the same test and it has a link to rerun the test with target jdk_foreign_0.
20 iterations
https://ci.adoptopenjdk.net/job/Grinder/3856/

But it failed. @rpshukla Could you help with the grinder? I don't know why my grinder failed, maybe something's wrong in my configuration?

@andrew-m-leonard
Copy link
Contributor Author

@andrew-m-leonard
Copy link
Contributor Author

@liqunl i've just kicked off a Grinder on xLinux for just the TestMissmatch test here: https://ci.adoptopenjdk.net/job/Grinder/3859/console

@andrew-m-leonard
Copy link
Contributor Author

@liqunl fyi, the "Rerun Grinder" links rarely work as-is, as they typically point at the "upstream" job artifact, which by the time you run the grinder has dissappeared!
I change it to use "customized" and point SDK_URL at the required github binary

@mayshukla
Copy link
Contributor

I started another plinux grinder using CUSTOMIZED_SDK_URL instead of UPSTREAM_JOB_NAME
https://ci.adoptopenjdk.net/job/Grinder/3861/

@liqunl
Copy link
Contributor

liqunl commented Sep 14, 2020

Looked at the xlinuxXL failure, the error message is the same as #10580, which is

TEST RESULT: Error. Agent error: java.lang.Exception: Agent 2 timed out with a timeout of 960 seconds; check console log for any additional details

So the crash is only seen on ppcle. FYI @gita-omr

@gita-omr
Copy link
Contributor

@AlenBadel could you kindly take a look?

@AlenBadel
Copy link
Contributor

AlenBadel commented Sep 15, 2020

Failure rate on my side is between 1/10 and 1/20.

Given the crash is in java.foreign package which I believe is still in preview, I don't think will be but we should still look at the crash to be sure it's a broader issue

java.foreign was bundled with OpenJDK and Oracle builds starting with Java 14. I understand J9 may not be in total sync with package support (i.e javafx). Is foreign bundled with J9 Java 14 builds?

@pshipton
Copy link
Member

Foreign is still incubator status for Java 15, but it is included.
https://openjdk.java.net/jeps/383

@AlenBadel
Copy link
Contributor

It looks like the foreign package is also included in J9 Java 14 builds. I ran it with the latest Java 14 release, and was not able to reproduce the failure. This suggests that this is a recent regression.
I'm still actively attempting to open the core file, and analyzing the failure.

@AlenBadel
Copy link
Contributor

I'm afraid that the core files are not readable on my end. I've even tried to add -Xrs in order to get more core context and that causes EOF issues with the test, none of the iterations have crashed using -Xrs.

Continuing my investigation by attempting to increase the reproducibility, so that I can reliably get this test to fail locally.

@pshipton
Copy link
Member

I'm guessing the core file(s) aren't readable because you need the jextract data from the machine that produced the core file. If it's reproducible on an OpenJ9 or internal jenkins machine we can get that. If it's only reproducible at Adopt, someone with access to those machines could run jextract.

@DanHeidinga
Copy link
Member

This is considered stop-ship until we have a core file and can evaluate the cause and frequency

@liqunl
Copy link
Contributor

liqunl commented Sep 16, 2020

It's reproducible on internal jenkins machine, failure rate seems to be 4/12 on nightly build. Also saw crash in compiled method jdk/internal/misc/Unsafe.setMemory(JJB)V
The following grinder is to reproduce the timeout, and it hasn't finished yet.
https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/10355

@AlenBadel
Copy link
Contributor

AlenBadel commented Sep 16, 2020

I was able to reproduce this locally on one of our internal farm machines. Failing at a rate of 100% on this particular machine.

jextract can't find the libraries unfortunately. GDB, as well as our internal debugging tools are not able to load the libraries even when given the prefix, and absolute path.

/workspace/ws/openjdk-tests$ ../jdk-15+36/bin/jextract ./TKG/test_output_16002861801429/jdk_foreign_0/work/java/foreign/TestSegments/core.20200916.155655.9347.0001.dmp
Loading dump file...
Read memory image from ./TKG/test_output_16002861801429/jdk_foreign_0/work/java/foreign/TestSegments/core.20200916.155655.9347.0001.dmp
Creating archive file: ./TKG/test_output_16002861801429/jdk_foreign_0/work/java/foreign/TestSegments/core.20200916.155655.9347.0001.dmp.zip
Adding "./TKG/test_output_16002861801429/jdk_foreign_0/work/java/foreign/TestSegments/core.20200916.155655.9347.0001.dmp"
Warning:  Could not find file "libgcc_s.so.1" for inclusion in archive
Warning:  Could not find file "libnet.so" for inclusion in archive
Warning:  Could not find file "libjava.so" for inclusion in archive
Warning:  Could not find file "ld64.so.2" for inclusion in archive
Warning:  Could not find file "libzip.so" for inclusion in archive
Warning:  Could not find file "librt.so.1" for inclusion in archive
Warning:  Could not find file "libm.so.6" for inclusion in archive
Warning:  Could not find file "libz.so.1" for inclusion in archive
Warning:  Could not find file "libjclse29.so" for inclusion in archive
Warning:  Could not find file "libjncrypto.so" for inclusion in archive
Warning:  Could not find file "libj9zlib29.so" for inclusion in archive
Warning:  Could not find file "libdl.so.2" for inclusion in archive
Warning:  Could not find file "libpthread.so.0" for inclusion in archive
Warning:  Could not find file "libnss_nis.so.2" for inclusion in archive
Warning:  Could not find file "libj9dmp29.so" for inclusion in archive
Warning:  Could not find file "libj9trc29.so" for inclusion in archive
Warning:  Could not find file "libc.so.6" for inclusion in archive
Warning:  Could not find file "libjimage.so" for inclusion in archive
Warning:  Could not find file "libjvm.so" for inclusion in archive
Warning:  Could not find file "linux-vdso64.so.1" for inclusion in archive
Warning:  Could not find file "libj9vm29.so" for inclusion in archive
Warning:  Could not find file "libjli.so" for inclusion in archive
Warning:  Could not find file "libj9prt29.so" for inclusion in archive
Warning:  Could not find file "libnss_compat.so.2" for inclusion in archive
Adding "/home/jenkins/workspace/ws/jdk-15+36/bin/java"
Warning:  Could not find file "libomrsig.so" for inclusion in archive
Warning:  Could not find file "libj9hookable29.so" for inclusion in archive
Warning:  Could not find file "libj9vrb29.so" for inclusion in archive
Warning:  Could not find file "libnsl.so.1" for inclusion in archive
Warning:  Could not find file "libj9thr29.so" for inclusion in archive
Warning:  Could not find file "libj9gc29.so" for inclusion in archive
Warning:  Could not find file "libj9shr29.so" for inclusion in archive
Warning:  Could not find file "libnio.so" for inclusion in archive
Warning:  Could not find file "libnss_files.so.2" for inclusion in archive
Warning:  Could not find file "libj9jvmti29.so" for inclusion in archive
Warning:  Could not find file "libextnet.so" for inclusion in archive
Adding "/home/jenkins/workspace/ws/jdk-15+36/lib/J9TraceFormat.dat"
Adding "/home/jenkins/workspace/ws/jdk-15+36/lib/OMRTraceFormat.dat"
jextract complete.

I've also tried to generate a trace file, without any luck.

I'll attempt to run this workload with gdb to get a snapshot of where we're crashing.

@AlenBadel
Copy link
Contributor

AlenBadel commented Sep 25, 2020

The size of the segment inside the java stack is correct. This issue is stemming from an instruction call srawi which only shifts the lower 32 bits. I.e it sounds like somewhere we're assuming the generated size to be an integer.

ld      gr0, [gr14, 176]                # SymRef  alignedSize<auto slot 
...
srawi   gr4, gr0, 5
mtctr   gr4

@AlenBadel
Copy link
Contributor

Changing srawi to sradi looks to resolve the issue on Power.

AlenBadel/omr@0b9653c

I'll be taking a look on xLinuxXL to see if it's a similar story.

@gita-omr
Copy link
Contributor

it sounds like somewhere we're assuming the generated size to be an integer.

I think we need to find out which opcode (or some recognized method?) this instruction is coming from and check/document overall assumptions about the incoming length for that code.

@IBMJimmyk
Copy link
Contributor

From what Alen showed me, it comes from Unsafe.setMemory(Ljava/lang/Object;JJB)V. The third parameter is the size of the data to write and it accepts a Long value. The generated code treating it like an int value is wrong.

@AlenBadel
Copy link
Contributor

Running the same test #10588 (comment) on XLinuxXL does not produce a crash, or does it hang.

I will continue attempting to reproduce this on XLinuxXL, but I believe the reproduction rate was established to be less than 1 in 300. #10580 (comment)

@andrewcraik
Copy link
Contributor

@IBMJimmyk is a tree problem or a codegen problem just to be clear on the width issue at play here? I'm guessing codegen?

@IBMJimmyk
Copy link
Contributor

At least the problem on Power was a codegen issue. The child for the size of the data to write was an lload which is expected. What was going on was the call to Unsafe.setMemory was being recognized as sun_misc_Unsafe_setMemory. Part way into generating code for a recognized call to Unsafe.setMemory the wrong data size is used from an instruction which treated the data as if it was int and not long. That's what caused the bug and is in the fix Alen made a few comments earlier.

@andrewcraik
Copy link
Contributor

ok great thanks - just confirming there wasn't a trees/common issue - looks to be codegen - thanks!

@AlenBadel
Copy link
Contributor

In regards to the crash on power, the following are the trees generated.

 n489n    (  0)  treetop                                                                              [    0x7fffcea98890] bci=[6,6,1361] rc=0 vc=19 vn=- li=45 udi=- nc=1
 n488n    (  0)    call  jdk/internal/misc/Unsafe.setMemory(Ljava/lang/Object;JJB)V[#487  final virtual Method -720] [flags 0x20500 0x0 ] ()  [    0x7fffcea98840] bci=[6,6,1361] rc=0 vc=19 vn=- li=45 udi=- nc=5 flg=0x20
 n483n    (  0)      aload  <temp slot 6>[#477  Auto] [flags 0x20000007 0x0 ] (X!=0 sharedMemory )    [    0x7fffcea986b0] bci=[6,0,1361] rc=0 vc=19 vn=- li=45 udi=- nc=0 flg=0x4
 n484n    (  0)      aconst NULL (in GPR_0641)                                                        [    0x7fffcea98700] bci=[6,1,1361] rc=0 vc=19 vn=- li=45 udi=36048 nc=0
 n485n    (  0)      lload  buf<auto slot 6>[#392  Auto] [flags 0x4 0x0 ] (in GPR_0642)               [    0x7fffcea98750] bci=[6,2,1361] rc=0 vc=19 vn=- li=45 udi=36320 nc=0
 n486n    (  0)      lload  alignedSize<auto slot 4>[#386  Auto] [flags 0x4 0x0 ] (in GPR_0644)       [    0x7fffcea987a0] bci=[6,3,1361] rc=0 vc=19 vn=- li=45 udi=36976 nc=0
 n495n    (  0)      iconst 0 (in GPR_0646)                                                           [    0x7fffcea98a70] bci=[-1,41,93] rc=0 vc=19 vn=- li=45 udi=37632 nc=0

I'll be taking a second look at the generated sequence to see if there were any similar assumptions, otherwise we should be ready for a PR.

On the XlinuxXL front, I'm checking if there are similar assumptions taken (I.e size of data treated as an int), otherwise there's very little I can do without being able to reproduce the failure.

@andrewcraik
Copy link
Contributor

so the trees do look correct so the issue would just be incorrect widths in the codegen - thanks for the great analysis @AlenBadel

@gita-omr
Copy link
Contributor

I think we still need to understand the whole flow: which opcodes and evaluators are involved. For example, we might be reusing some code that normally accepts int length.

@AlenBadel
Copy link
Contributor

I think we still need to understand the whole flow: which opcodes and evaluators are involved. For example, we might be reusing some code that normally accepts int length.

I agree. I will have an update soon with the summary of the issue from a top-down approach.

@keithc-ca
Copy link
Contributor

It occurs to me that this would have been avoided if we had some kind of 'compiler' to check uses like that, along the lines of the error you get in Java passing a long to a function that expects an int (e.g. Integer.valueOf(42L)). To make it concrete here, we would have a class whose job is to form an instruction stream, that class would have methods srawi(Register32,Register32,int32_t) and sradi(Register64,Register64,int64_t) - calling the former with 64-bit registers or a a 64-bit value would not compile.

@gita-omr
Copy link
Contributor

gita-omr commented Sep 28, 2020

Yeah, it's something to consider in the future. Basically, Power codegen's assumption is that a register containing an Int can have an undefined value in its upper half. So Int and Long types need to be treated differently. However we only have Register class and not Register32 and Register64.

@AlenBadel
Copy link
Contributor

AlenBadel commented Sep 30, 2020

To Summarize the issue on Power:

We’ve established that the crash occurs when MemorySegment.allocateNative is JIT compiled and invoked with a requested memory segment size greater than Integer.MAX_VALUE.

E.x MemorySegment.allocateNative((long)Integer.MAX_VALUE + 10L)

The method attempts to create, and allocate a new memory segment that is an allocated block from off-heap memory. The size argument represents the size of the memory segment requested.

The crash
The segmentation fault occurs inside the compiled inlined call to Unsafe.SetMemory which is called by MemorySegment.allocateNative. Given the reference to the memory segment, offset, size and a value, the method will set the entire memory segment to a value specified by the value argument. In this case, the method is being called to clear memory.

A segmentation fault occurs because the system was attempting to clear read-only memory[1]. This is happening because the instructions generated which computes how much memory should be cleared was not correct[2].

Where was this code generated? 


Taking a look at the trees for the sequence of generated instructions[3], we can see that the compiled method is calling unsafe_setmemory, which is a recognized method[4]. 

Specifically on Power, Unsafe.SetMemory is compiled using the setmemoryEvaluator[5], which is only invoked by inlineDirectCall. The InlineDirectCall is invoked when compiling an inlined call(TR::call), this is illustrated looking in the prior treetop reference. This also explains why our test case would not fail when disabling inlining, as that would prevent inlining of the direct call which ultimately generates the sequence.

The only point where setmemoryEvaluator is called, is when compiling Unsafe.SetMemory[6]. The signature for Unsafe.SetMemory suggests four arguments (Object, long, long, bytes). Hence the size of the segment is always specified as a long. Further it is a valid and concrete requirement that the size of the segment be a long, and the evaluator is not used for any other case other than compiling Unsafe.SetMemory.

The Fix

As discussed within the Unsafe.SetMemory compiled method, this issue stems from an instruction call srawi. The intention of this instruction is to properly set the loop counter so that the entire data region is properly encoded[1].
This instruction shifts the lower 32 bits, which limits the size of the memory region to a size contained by an int(32-bits), rather than the long(64-bits) we expect. Instead, this should be replaced by sradi which shifts a double word, and supports a segment size of long[7].

[1] #10588 (comment)
[2] #10588 (comment)
[3] #10588 (comment)
[4] https://github.com/eclipse/openj9/blob/9b680196cdee54cd185340e287e13ad924b9052a/runtime/compiler/env/j9method.cpp#L3362
[5] https://github.com/eclipse/omr/blob/0b9653c23ba22247f66c11ad6df8a3b46c2c4738/compiler/p/codegen/OMRTreeEvaluator.cpp#L4159-L4160
[6] https://github.com/eclipse/openj9/blob/9369e4e005a130411ae2606a25d08588a0920c7a/runtime/compiler/p/codegen/J9TreeEvaluator.cpp#L12052
[7] #10588 (comment)

@gita-omr
Copy link
Contributor

gita-omr commented Sep 30, 2020

Thanks for the very detailed investigation and the summary. However, I am not sure I agree with the proposed fix.

setmemoryEvaluator is defined in OMR without much description of the expected input parameters [1]. It is invoked in openj9 like this:

https://github.com/eclipse/openj9/blob/9369e4e005a130411ae2606a25d08588a0920c7a/runtime/compiler/p/codegen/J9TreeEvaluator.cpp#L12049-L12052

Note that the length information is conveyed via creating an arrayset node that normally takes Int length.

I am not sure we can change OMR method to adjust it to a particular consumer, openj9, especially since even this consumer passes information in somewhat inconsistent way.

[1] https://github.com/eclipse/omr/blob/0b9653c23ba22247f66c11ad6df8a3b46c2c4738/compiler/p/codegen/OMRTreeEvaluator.cpp#L4159-L4160

@gita-omr
Copy link
Contributor

BTW : I think we need to remove the Intel label and assume that this issue will only take care of the Power failure.

@pshipton pshipton removed the arch:x86 label Sep 30, 2020
@pshipton
Copy link
Member

pshipton commented Sep 30, 2020

Removed the intel label, this platform should be covered by #10580

@AlenBadel
Copy link
Contributor

setmemoryEvaluator is defined in OMR without much description of the expected input parameters [1]

This description can certainly be added.

Note that the length information is conveyed via creating an arrayset node that normally takes Int length.

This is true, an ArraySet does normally allocate an array length of type int due to indexing limitations. The limitation is historical, and based off the class ArraySet inherits; ArrayList which itself is an implementation of List.

Looking through the remainder of the Unsafe methods, we can see that we tend use the underlying array evaluators. From what I've seen they all support a length which could be long, or int.

For example, such case as Unsafe.CopyMemory[1] which uses the underlying arrayCopyEvaluator[2]. We see that the underlying arrayCopyEvaluator does support sizes of long[3].

I believe as long as the evaluators we use support the initialization of a contiguous memory segment which has a length argument of type long then we could easily justify it's use to compile and evaluate Unsafe methods.
To note, arrays do not have a bound on how large they can be, rather they have a bound on the number of objects (ArrayList based implementations i.e ArraySet) or primitives they index (Integer.MAX_VALUE). Hence the process of initialization of contiguous memory is identical in the case of Unsafe.CopyMemory, or Unsafe.SetMemory compared to arrays themselves.

I am not sure we can change OMR method to adjust it to a particular consumer, openj9, especially since even this consumer passes information in somewhat inconsistent way.

The evaluator should support both long, or int length values. This is consistent with many of our other evaluators. Java is not the only language that has the capability to allocate, and initialize a very-large contiguous region of memory.

[1] https://github.com/eclipse/omr/blob/0b9653c23ba22247f66c11ad6df8a3b46c2c4738/compiler/p/codegen/OMRTreeEvaluator.cpp#L4276-L4277
[2] https://github.com/eclipse/openj9/blob/9369e4e005a130411ae2606a25d08588a0920c7a/runtime/compiler/p/codegen/J9TreeEvaluator.cpp#L12842
[3] https://github.com/eclipse/openj9/blob/9369e4e005a130411ae2606a25d08588a0920c7a/runtime/compiler/p/codegen/J9TreeEvaluator.cpp#L12890-L12891

@gita-omr
Copy link
Contributor

The evaluator should support both long, or int length values.

Do you mean we should be able to pass nodes with the length child being both Int and Long type?

@AlenBadel
Copy link
Contributor

AlenBadel commented Sep 30, 2020

Do you mean we should be able to pass nodes with the length child being both Int and Long type?

They can be either Long, or Int. We currently don't call setMemoryEvaluator for any other case other than Unsafe.SetMemory() which accepts a long type length value.

However, it's better illustrated with the example of arrayCopyEvaluator.
Which is used by Unsafe.CopyMemory as it accepts a length value of long type, and at the same time is used with System.arraycopy which accepts a length of int type.

@keithc-ca
Copy link
Contributor

I had a look at how writeCrashDataToConsole() might fail, but didn't see anything obvious. Given that the current theory is that the generated code was clearing memory that should have been left alone, it's reasonable that some of the data examined by writeCrashDataToConsole() was corrupted and so a second fault should not be unexpected.

@gita-omr
Copy link
Contributor

gita-omr commented Sep 30, 2020

However, it's better illustrated with the example of arrayCopyEvaluator.
Which is used by Unsafe.CopyMemory as it accepts a length value of long type, and at the same time is used with System.arraycopy which accepts a length of int type.

Exactly. We can use if lengthNode->getType().isInt32() to generate either srawi or sradi. That would be the safest way to address the issue.

@AlenBadel
Copy link
Contributor

OMR PR: eclipse/omr#5590

@AlenBadel
Copy link
Contributor

The above PR has been merged.

Fix will be merged into R0.23 as well.
eclipse-openj9/openj9-omr#81

@gita-omr
Copy link
Contributor

gita-omr commented Oct 6, 2020

@andrew-m-leonard can this be closed?

@pshipton
Copy link
Member

pshipton commented Oct 6, 2020

I'll go ahead and close it since the fix has been merged, we can open a new issue if another problem is found.

@pshipton pshipton closed this as completed Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch:power comp:jit prio:high segfault Issues that describe segfaults / JVM crashes test failure
Projects
None yet
Development

No branches or pull requests