Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FFI/Jtreg_JDK21] IllegalArgumentException detected in TestNested.java on AIX #18287

Closed
ChengJin01 opened this issue Oct 16, 2023 · 61 comments
Closed
Assignees
Labels
jdk21 project:panama Used to track Project Panama related work test failure

Comments

@ChengJin01
Copy link
Contributor

ChengJin01 commented Oct 16, 2023

With my changes (intended for union in downcall & upcall) at ChengJin01@5c3f1e4#diff-710539390ec016913e8ff24f64faf40f7e64cc9de4b1f113e9f0c9820b22908e, two of the subtests in testNested at https://github.com/ibmruntimes/openj9-openjdk-jdk21/blob/openj9/test/jdk/java/foreign/nested/TestNested.java failed on AIX with the following exceptions:

test TestNested.testNested(jdk.internal.foreign.layout.StructLayoutImpl@8f8a4a78): failure
java.lang.IllegalArgumentException: Member layout '[2:D8](f1)', of 
'[B1(f0)x7
 [2:D8](f1)
 B1(f2)x7
 [[3:D8](f0)[S2(f0)]
 (f1)x6
  A8(f2):[*:B1]
  A8(f3):[*:B1]]
 (f3)] (S9)' found at unexpected offset: 8 != 4
	at java.base/jdk.internal.foreign.abi.AbstractLinker.checkMemberOffset(AbstractLinker.java:236)
	at java.base/jdk.internal.foreign.abi.AbstractLinker.checkLayoutRecursive(AbstractLinker.java:196)
	at java.base/jdk.internal.foreign.abi.AbstractLinker.checkLayout(AbstractLinker.java:182)
	at java.base/java.util.Optional.ifPresent(Optional.java:178)
	at java.base/jdk.internal.foreign.abi.AbstractLinker.checkLayouts(AbstractLinker.java:173)
	at java.base/jdk.internal.foreign.abi.AbstractLinker.downcallHandle0(AbstractLinker.java:101)
	at java.base/jdk.internal.foreign.abi.AbstractLinker.downcallHandle(AbstractLinker.java:88)
	at NativeTestHelper.downcallHandle(NativeTestHelper.java:163)
	at TestNested.testNested(TestNested.java:66)

test TestNested.testNested(jdk.internal.foreign.layout.UnionLayoutImpl@6226c73e): failure
java.lang.IllegalArgumentException: Member layout '[2:D8](f1)', 
of '[B1(f0)x7 [2:D8](f1) 
B1(f2)x7[[3:D8](f0)[S2(f0)](f1)x6A8(f2)
:[*:B1]A8(f3):[*:B1]](f3)](f2)' found at unexpected offset: 8 != 4
	at java.base/jdk.internal.foreign.abi.AbstractLinker.checkMemberOffset(AbstractLinker.java:236)
	at java.base/jdk.internal.foreign.abi.AbstractLinker.checkLayoutRecursive(AbstractLinker.java:196)
	at java.base/jdk.internal.foreign.abi.AbstractLinker.checkLayoutRecursive(AbstractLinker.java:209)
	at java.base/jdk.internal.foreign.abi.AbstractLinker.checkLayout(AbstractLinker.java:182)
	at java.base/java.util.Optional.ifPresent(Optional.java:178)
	at java.base/jdk.internal.foreign.abi.AbstractLinker.checkLayouts(AbstractLinker.java:173)
	at java.base/jdk.internal.foreign.abi.AbstractLinker.downcallHandle0(AbstractLinker.java:101)
	at java.base/jdk.internal.foreign.abi.AbstractLinker.downcallHandle(AbstractLinker.java:88)
	at NativeTestHelper.downcallHandle(NativeTestHelper.java:163)
	at TestNested.testNested(TestNested.java:66)

FYI: @tajila, @pshipton

@ChengJin01 ChengJin01 added project:panama Used to track Project Panama related work test failure jdk21 labels Oct 16, 2023
@ChengJin01
Copy link
Contributor Author

Based on failing test cases are as follows:
https://github.com/ibmruntimes/openj9-openjdk-jdk21/blob/9e4783d1f1520d34f5261f8e94ec5d22004ade08/test/jdk/java/foreign/nested/libNested.c#L46

static final StructLayout S9 = MemoryLayout.structLayout(
            C_CHAR.withName("f0"),
            MemoryLayout.paddingLayout(7), <-----------
            MemoryLayout.sequenceLayout(2, C_DOUBLE).withName("f1"),
            C_CHAR.withName("f2"),
            MemoryLayout.paddingLayout(7), <------------
            S8.withName("f3")
    ).withName("S9");

static final StructLayout S8 = MemoryLayout.structLayout(
            MemoryLayout.sequenceLayout(3, C_DOUBLE).withName("f0"),
            U7.withName("f1"),
            MemoryLayout.paddingLayout(6),
            C_POINTER.withName("f2"),
            C_POINTER.withName("f3")
    ).withName("S8");

 static final UnionLayout U7 = MemoryLayout.unionLayout(
            C_SHORT.withName("f0")
    ).withName("U7");

plus the corresponding native code at https://github.com/ibmruntimes/openj9-openjdk-jdk21/blob/9e4783d1f1520d34f5261f8e94ec5d22004ade08/test/jdk/java/foreign/nested/libNested.c#L44C1-L46C60

union U7{ short f0; };
struct S8{ double f0[3]; union U7 f1; void* f2; void* f3; };
struct S9{ char f0; double f1[2]; char f2; struct S8 f3; };

EXPORT struct S8 test_S8(struct S8 arg, struct S8(*cb)(struct S8)) { return cb(arg); }
EXPORT struct S9 test_S9(struct S9 arg, struct S9(*cb)(struct S9)) { return cb(arg); }

against the captured exceptions in the description, it indicates that there might be a padding issue with double on AIX as we modified the code on AIX in the OpenJDK extension to be 4-byte aligned double. So I wrote a simple test to verify how it goes with the following struct:

typedef struct stru_char_double_char {  <---------- var1
        char elem1;
        double elem2[2];
        char elem3;
} stru_char_double_char;

int main()
{
     stru_char_double_char  var1 = {0x1, 22.333, 22.333, 0x2};
     printf("size of var1 = %d\n", sizeof(var1));
    return 0;
}

to debug the test as follows:

[1] stopped in main at line 25
   25       printf("size of var1 = %d\n", sizeof(var1));
(dbx) p var1
(elem1 = '^A', elem2 = (22.332999999999998, 22.332999999999998), elem3 = '^B')
(dbx) p &var2
0x2ff22b30
(dbx) p &var2.elem1
0x2ff22b30
(dbx) p &var2.elem2[0]
0x2ff22b34 <-------------------- the padding between `char` and `double` is 3 bytes
(dbx) p &var2.elem2[1]
0x2ff22b3c
(dbx) p &var2.elem3
0x2ff22b44
(dbx) p sizeof(var2)
24    <-------------------- the padding after the 2nd `char` is 3 bytes

which means the failing jtreg test struct S9 should be modified to with the correct padding bytes on AIX.

@ChengJin01
Copy link
Contributor Author

To isolate the test cases, I modified the failing test struct as follows:

static final StructLayout S9 = MemoryLayout.structLayout(
            C_CHAR.withName("f0"),
            MemoryLayout.paddingLayout(3), <-----------
            MemoryLayout.sequenceLayout(2, C_DOUBLE).withName("f1"),
            C_CHAR.withName("f2"),
            MemoryLayout.paddingLayout(3), <------------
            S8.withName("f3")
    ).withName("S9");

and disabled the check on the returned value at https://github.com/ibmruntimes/openj9-openjdk-jdk21/blob/9e4783d1f1520d34f5261f8e94ec5d22004ade08/test/jdk/java/foreign/nested/TestNested.java#L73

    public void testNested(GroupLayout layout) throws Throwable {
        try (Arena arena = Arena.ofConfined()) {
...
            MemorySegment returned = (MemorySegment) downcallHandle.invokeExact(
                    (SegmentAllocator) arena, (MemorySegment) testValue.value(), stub);

            // testValue.check().accept(returnBox.get()[0]); <----- ignore the check on the returned value
            testValue.check().accept(returned);
        }
    }

plus the modified native code (to simply return the passed-in struct rather than triggering the upcall thunk) at https://github.com/ibmruntimes/openj9-openjdk-jdk21/blob/9e4783d1f1520d34f5261f8e94ec5d22004ade08/test/jdk/java/foreign/nested/libNested.c#L79 to ensure it only works for downcall.

EXPORT struct S9 test_S9(struct S9 arg, struct S9(*cb)(struct S9)) { 
   // return cb(arg);  <------ disable the upcall thunk
   rerurn arg; 
   }

which works good and the failing tests passed without any issue. That being said, there is no issue with downcall as long as the padding bytes are correct on AIX.

However, it crashed when invoking the upcall thunk (not yet reaching the upcall dispather) when enabling the upcall in the native code:

EXPORT struct S9 test_S9(struct S9 arg, struct S9(*cb)(struct S9)) { 
     return cb(arg);  <------ disable the upcall thunk
   }

with the following stacktrace:

Signal 0 in genSystemCoreUsingGencore at line 203 in file ".../openj9-openjdk-jdk21/omr/port/aix/omrosdump_helpers.c" ($t27)
  203           rc = gencore(&coreDumpInfo);
where
(dbx) genSystemCoreUsingGencore(??, ??), line 203 in "omrosdump_helpers.c"
omrdump_create(??, ??, ??, ??), line 118 in "omrosdump.c"
doSystemDump(??, ??, ??), line 763 in "dmpagent.c"
protectedDumpFunction(??, ??), line 2897 in "dmpagent.c"
omrsig_protect(??, ??, ??, ??, ??, ??, ??), line 425 in "omrsignal.c"
runDumpAgent(??, ??, ??, ??, ??, ??), line 2875 in "dmpagent.c"
triggerDumpAgents(??, ??, ??, ??), line 1041 in "trigger.c"
generateDiagnosticFiles(??, ??), line 1162 in "gphandle.c"
omrsig_protect(??, ??, ??, ??, ??, ??, ??), line 425 in "omrsignal.c"
structuredSignalHandler(??, ??, ??, ??), line 837 in "gphandle.c"
mainSynchSignalHandler(??, ??, ??), line 1066 in "omrsignal.c"
test_S9(arg = (...), cb = 0xb765432122a8bae0), line 84 in "libNested.c" <----------
ffi_call_AIX(), line 110 in "aix.S"
ffi_call(??, ??, ??, ??), line 945 in "ffi_darwin.c"
ffiCallWithSetJmpForUpcall(??, ??, ??, ??, ??), line 73 in "UpcallExceptionHandler.cpp"
bytecodeLoopCompressed() at 0x90000000c69a238
c_cInterpreter(), line 48 in "pcinterp.s"
runJavaThread(??), line 682 in "callin.cpp"
javaProtectedThreadProc(J9PortLibrary*, void*)(??, ??), line 2104 in "vmthread.cpp"
omrsig_protect(??, ??, ??, ??, ??, ??, ??), line 425 in "omrsignal.c"
javaThreadProc(??), line 383 in "vmthread.cpp"
thread_wrapper(??), line 1733 in "omrthread.c"

Considering there is no issue on pLinux and other platforms in these tests, there should be no problem with the encoded native signature as they are shared on all supported platforms, in which case it is mostly likely to be an issue with this type of struct in the upcall thunk on AIX.

@ChengJin01
Copy link
Contributor Author

@zl-wang, could you please help take a look at what really happened to the thunk on AIX?

ChengJin01 pushed a commit to ChengJin01/openj9-openjdk-jdk21 that referenced this issue Oct 16, 2023
The changes reflect the actual padding bytes of
the failing test structs support double on AIX.

Related: #eclipse-openj9/openj9/issues/18287

Signed-off-by: ChengJin01 <jincheng@ca.ibm.com>
@zl-wang
Copy link
Contributor

zl-wang commented Oct 16, 2023

all of your modifications still made the struct as if it were packed (i.e. members are not aligned right). is it intended to be that way? the correct padding should look like: char, 7-byte padding, two-double-array, char (with or without padding i think would be fine).

@zl-wang
Copy link
Contributor

zl-wang commented Oct 16, 2023

i guessed the struct C declaration might look like:

   struct Foo {
         char  f0;
         double f1[2];
          char f2;
   }

Is my guess right? then, 7-byte padding should be expected. the size is 32-byte i think.

@ChengJin01
Copy link
Contributor Author

ChengJin01 commented Oct 16, 2023

i guessed the struct C declaration might look like:

   struct Foo {
         char  f0;
         double f1[2];
          char f2;
   }

Is my guess right? then, 7-byte padding should be expected. the size is 32-byte i think.

Not really. It is 3 bytes in terms of the padding detected at #18287 (comment)

[1] stopped in main at line 25
   25       printf("size of var1 = %d\n", sizeof(var1));
(dbx) p var1
(elem1 = '^A', elem2 = (22.332999999999998, 22.332999999999998), elem3 = '^B')
(dbx) p &var2
0x2ff22b30
(dbx) p &var2.elem1
0x2ff22b30
(dbx) p &var2.elem2[0]
0x2ff22b34 <-------------------- the padding between `char` and `double` is 3 bytes
(dbx) p &var2.elem2[1]
0x2ff22b3c
(dbx) p &var2.elem3
0x2ff22b44
(dbx) p sizeof(var2)
24    <-------------------- the padding after the 2nd `char` is 3 bytes

@zl-wang
Copy link
Contributor

zl-wang commented Oct 16, 2023

3-byte padding is a valid choice, but only for 32-bit mode (i.e. 32-bit executable). it is not right for 64-bit mode (our JVM mode ... we don't do 32bit build after java8).

@zl-wang
Copy link
Contributor

zl-wang commented Oct 16, 2023

it is an easy test: xlC -q64 your-C-test.c to see if 3-byte or 7-byte padding is applied.
without -q64, the default is 32-bit mode compilation.

@ChengJin01
Copy link
Contributor Author

ChengJin01 commented Oct 16, 2023

it is an easy test: xlC -q64 your-C-test.c to see if 3-byte or 7-byte padding is applied. without -q64, the default is 32-bit mode compilation.

Here's the result of the test as follows:

typedef struct stru_char_double_char {
        char elem1;
        double elem2[2];
        char elem3;
} stru_char_double_char;

int main()
{
stru_char_double_char var1;
    var1.elem1 = 0x1;
    var1.elem2[0] = 22.333;
    var1.elem2[1] = 22.333;
    var1.elem3 = 0x2;

    printf("size of var1 = %d\n", sizeof(var1));
   return 0;
}

-bash-5.0$ xlc  -q64 -g struct.c -o struct
-bash-5.0$ dbx struct
Type 'help' for help.
reading symbolic information ...
(dbx) stop at 26
[1] stop at 26
(dbx) run
[1] stopped in main at line 26
   26       printf("size of var1 = %d\n", sizeof(var1));
(dbx) p var1
(elem1 = '^A', elem2 = (22.332999999999998, 22.332999999999998), elem3 = '^B')
(dbx) p sizeof(var1)
24  <---------------------  4 + 8 * 2 + 4 = 24
(dbx) p &var1.elem1
0x0ffffffffffff9e0
(dbx) p &var1.elem2[0]
0x0ffffffffffff9e4    <------------------- 3 bytes
(dbx) p &var1.elem2[1]
0x0ffffffffffff9ec
(dbx) p &var1.elem3
0x0ffffffffffff9f4

in which I didn't notice there is any difference in padding as compared to the 32-bit mode.

@zl-wang
Copy link
Contributor

zl-wang commented Oct 16, 2023

that looks like a compiler bug. we can pursue this bug separately. in any case, what did you present to the thunk generation mechanism? it should be AGGREGATE_OTHER (if I remember correctly) with 24-byte in size, then it should work.
you will have 32-byte in size on other 64bit platforms i think, i.e. your signature encoding code is going to be different between AIX and others.

@tajila tajila added this to the Java 21 (0.42) milestone Oct 16, 2023
@ChengJin01
Copy link
Contributor Author

it should be AGGREGATE_OTHER (if I remember correctly) with 24-byte in size, then it should work.
you will have 32-byte in size on other 64bit platforms

That' correct as the struct size for upcall is computed ahead of time in the java code before passing it to sigType->sizeInByte for the native signature in native.

your signature encoding code is going to be different between AIX and others.

If we change the code in https://github.com/eclipse-openj9/openj9/blob/master/runtime/vm/LayoutFFITypeHelpers.hpp for the native signature encoding on AIX, does it mean the upcall thunk should be updated accordingly to differentiate AIX from other platform in such case?

@zl-wang
Copy link
Contributor

zl-wang commented Oct 16, 2023

If we change the code in https://github.com/eclipse-openj9/openj9/blob/master/runtime/vm/LayoutFFITypeHelpers.hpp for the native signature encoding on AIX, does it mean the upcall thunk should be updated accordingly to differentiate AIX from other platform in such case?

yes, different AGGREGATE_OTHER and size combination presented to the thunk gen ... different thunk code(s) will be generated to fit with the upcall. i hoped the signature coming in on AIX already has the correct padding identified (at least for the time being as it is).

@ChengJin01
Copy link
Contributor Author

yes, different AGGREGATE_OTHER and size combination presented to the thunk gen ... different thunk code(s) will be generated to fit with the upcall...

If so, this is something we should coordinate to ensure the new type (.e.g. J9_FFI_UPCALL_SIG_TYPE_STRUCT_AGGREGATE_OTHER_C_D (char, double) or something else more preferable) is only intended for AIX to minimize the changes on other platforms.

@zl-wang
Copy link
Contributor

zl-wang commented Oct 16, 2023

If so, this is something we should coordinate to ensure the new type (.e.g. J9_FFI_UPCALL_SIG_TYPE_STRUCT_AGGREGATE_OTHER_C_D (char, double) or something else more preferable) is only intended for AIX to minimize the changes on other platforms.

that sounds unsustainable. you cannot cover all these unlimited possibilities ... C_D, C_C_D, etc etc i remembered you told us there is signature with correct padding available to you.

@ChengJin01
Copy link
Contributor Author

ChengJin01 commented Oct 16, 2023

i remembered you told us there is signature with correct padding available to you.

You mean the existing definition we already have now in the code? but I can't determine which one defined from

#define J9_FFI_UPCALL_SIG_TYPE_STRUCT_AGGREGATE_ALL_SP 0x1A /* Intended for structs with all floats */
is suitable in this case.

@zl-wang
Copy link
Contributor

zl-wang commented Oct 16, 2023

AGGREGATE_OTHER is the suitable/correct one to encode to ... of course need the right size (24 on AIX and 32 on others).

@ChengJin01
Copy link
Contributor Author

AGGREGATE_OTHER is the suitable/correct one to encode to ... of course need the right size (24 on AIX and 32 on others).

Then there is no need to add a new definition given the correct struct size is already set to sigType->sizeInByte for the upcall thunk.

@ChengJin01
Copy link
Contributor Author

With the following struct in the test on AIX:

{
         char  f0;
         double f1[2];
          char f2;
}

the printing-out message from the code seems correct in terms of sigArray input. Now the question is why it crashed in there.

encodeUpcallSignature: sigType->sizeInByte = 24, cSignature = #[C(3)2:DC(3)]
encodeUpcallSignature: sigType->type = 58  <----- J9_FFI_UPCALL_SIG_TYPE_STRUCT_AGGREGATE_OTHER 0x3A

@keithc-ca
Copy link
Contributor

that looks like a compiler bug. we can pursue this bug separately

If a bug, it's not clear how it could be fixed without breaking compatibility with existing binaries.

@zl-wang
Copy link
Contributor

zl-wang commented Oct 17, 2023

If a bug, it's not clear how it could be fixed without breaking compatibility with existing binaries.

it is a bug from expectation perspective (aligned the same as other 64bit platforms). there is no compatibility issue here (i.e. it is consistent always within AIX).

@keithc-ca
Copy link
Contributor

If the first failure relates to this:

    static final StructLayout S9 = MemoryLayout.structLayout(
            C_CHAR.withName("f0"),
            MemoryLayout.paddingLayout(7),
            MemoryLayout.sequenceLayout(2, C_DOUBLE).withName("f1"),
            C_CHAR.withName("f2"),
            MemoryLayout.paddingLayout(7),
            S8.withName("f3")
    ).withName("S9");

then I believe the test is just broken in that it assumes 7 bytes of padding between f0 and f1.

Alternatively, the checking in AbstractLinker should not be ignoring padding members.

@keithc-ca
Copy link
Contributor

it is a bug from expectation perspective

I had the same expectation, but I don't see we have any alternatives to working with what the compiler does (and I expect must continue to do). Calling it a "bug" suggests it might someday be fixed, but, as I said, I don't think this can be "fixed".

@zl-wang
Copy link
Contributor

zl-wang commented Oct 22, 2023

@ChengJin01 David (@edelsohn ) provided a few informative libffi links as below:
python/cpython#82809 (known libffi problems with nested struct)
https://github.com/libffi/libffi/blob/4661ba7928b49588aec9e6976673208c8cbf0295/doc/libffi.texi#L505 (no direct support for arrays or unions. However, they can be emulated using structures.)

To verify if this PR is the same, you can test out this below struct:
struct {char a; double b1, b2; char; void *p;}

It is a simplified emulation of this PR's problematic (for libffi) struct. Exactly the same size, padding, and memory layout ...

@ChengJin01
Copy link
Contributor Author

https://github.com/libffi/libffi/blob/4661ba7928b49588aec9e6976673208c8cbf0295/doc/libffi.texi#L505 (no direct support for arrays or unions. However, they can be emulated using structures.)

That's how we implemented the union in FFI at #18291

To verify if this PR is the same, you can test out this below struct:
struct {char a; double b1, b2; char; void *p;}

This is something I already verified previous by modifying the failing test which crashed in the same way as struct {char a; double b[2]; char; void *p;}. But I can double-check again to confirm that.

@zl-wang
Copy link
Contributor

zl-wang commented Oct 23, 2023

i didn't notice the ending *p. I intended to say testing out this one:
struct {char a; double b1, b2; char c;}
this is exactly the same as your original failing case, in terms of size, padding, and layout. just doesn't have array and nested struct.

@ChengJin01
Copy link
Contributor Author

i didn't notice the ending *p. I intended to say testing out this one:
struct {char a; double b1, b2; char c;}
this is exactly the same as your original failing case, in terms of size, padding, and layout. just doesn't have array and nested struct.

After changing the java code of struct S9 to

    static final StructLayout S9 = MemoryLayout.structLayout(
            C_CHAR.withName("f0"),
            MemoryLayout.paddingLayout(3),
            C_DOUBLE.withName("f1"),
            C_DOUBLE.withName("f2"),
            C_CHAR.withName("f3"),
            MemoryLayout.paddingLayout(3)
    ).withName("S9");

plus the native code as follows:

struct S9{ char f0; double f1, f2;  char f3; };

the failing test for S9 passed without any issue. That being said, the crash occurred only when there is a nested double array in the struct.

@edelsohn
Copy link

As @zl-wang quoted the libffi documentation link that I sent him, arrays are not first-class objects in libffi. Arrays and nested arrays do not function in libffi reliably.

Can the Java code pretend that the pair of doubles are non-nested members of the struct and claim victory?

@ChengJin01
Copy link
Contributor Author

ChengJin01 commented Oct 23, 2023

Can the Java code pretend that the pair of doubles are non-nested members of the struct and claim victory?

Theoretically we could do by changing the array to primitives in the java code that but there is no easy way to determine determine what scenario it should be replaced with simply primitives (rather than a nested array) given struct S9 { char f0; double f1[2], char f2; } is the only exception we detected so far (while other nested arrays work good as expected).

@zl-wang
Copy link
Contributor

zl-wang commented Oct 23, 2023

from the testings we have done (which case succeeded and which failed) and what @edelsohn described arrays and nested-arrays as unreliable for libffi, I think we can claim victory from java perspective for the time being.

@edelsohn is this-unreliability happening on other platforms too? plus, the main change (for specific struct) is in C code when you fill in the ffi_type struct in order to use libffi where you can emulate arrays as sequence of scalar in providing struct definition to libffi. the java-side changes are only relevant to calculating field-offset within C struct for java code to access them in the corresponding/reflective MemorySegment.

@edelsohn
Copy link

If one looks at the Python and libffi file history, it has been tweaked for multiple platforms, but fewer and fewer have the type of ABI quirks as AIX.

Because the inlined double members behave as expected and the array doesn't, this probably is related to libffi emulating the array as a nested struct, and a struct with the first member a double has a stricter alignment and padding than an interior double. The "as if" behavior of libffi is not correct in this case and the member position doesn't agree.

@daltenty
Copy link

daltenty commented Oct 24, 2023

Because the inlined double members behave as expected and the array doesn't, this probably is related to libffi emulating the array as a nested struct, and a struct with the first member a double has a stricter alignment and padding than an interior double. The "as if" behavior of libffi is not correct in this case and the member position doesn't agree.

It's not clear that case is broken either, I tried this case in an libffi test. We preform a call which takes the struct by value, increments the members and returns the struct and everything comes back fine.

The struct:

typedef struct
{
  unsigned char c1;
  double s[2];
  unsigned char c2;
} test_structure_12;

How Clang computes layout on AIX:

*** Dumping AST Record Layout
         0 | test_structure_12
         0 |   unsigned char c1
         4 |   double[2] s
        20 |   unsigned char c2
           | [sizeof=24, align=4, preferredalign=4]

The libffi type description:

  ts12a_type.type = FFI_TYPE_STRUCT;
  ts12a_type.elements = ts12a_type_elements;
  ts12a_type_elements[0] = &ffi_type_double;
  ts12a_type_elements[1] = &ffi_type_double;
  ts12a_type_elements[2] = NULL;

  ts12_type.type = FFI_TYPE_STRUCT;
  ts12_type.elements = ts12_type_elements;
  ts12_type_elements[0] = &ffi_type_uchar;
  ts12_type_elements[1] = &ts12a_type;
  ts12_type_elements[2] = &ffi_type_uchar;
  ts12_type_elements[3] = NULL;      

The resulting size and alignment from libffi:

size: 28
align: 4

so everything seems to agree, at least for this case.

I think to clarify the crashing case, we'd need to be clear on what the ffitype description java passes to fficall looks like. That would help clearly identify if the libffi_call is faulty.

(Another suggestion that came to mind while reading #18287 (comment) where the upcall was disabled, would be to try just a downcall with a test function on the C side that will modify the struct and then have the java side inspect the return, similar to what we had in the stand alone ffi test. That might also help clarify if the ffi_call is going wrong)

@ChengJin01
Copy link
Contributor Author

ChengJin01 commented Oct 24, 2023

I think to clarify the crashing case, we'd need to be clear on what the ffitype description java passes to fficall looks like. That would help clearly identify if the libffi_call is faulty.

The code in there (java & native) is correct to generate the simplified string like [C2:DC] (where C stands for char while 2:D means a array of 2 doubles) which is used in creating the specified ffitypes on all supported platforms (otherwise everything is messed up in there).

Your case is wrong as it is verifying struct test_structure_12 { char f0; double f1[2]; char f2;} with the function like func(struct test_structure_12 a) rather than func(struct a, void *p). So you should try with struct { char f0; double f1[2]; char f2; void *p; } by adding one more pointer parameter in the end of the struct to see how it goes. Or try with a struct and a pointer as two separate parameters for a function like func(struct a, void *p) to see what happened to the pointer parameter (which is what happens in the crash due to the messed-up pointer (nothing to do with upcall)) .

Based on what we observed in the dump, the elements f0, f1, f2 were correct in terms the passed-in values except the corrupted pointer p.

@ChengJin01
Copy link
Contributor Author

ChengJin01 commented Oct 24, 2023

@daltenty,

ts12_type_elements[0] = &ffi_type_uchar;

Maybe it doesn't matter but we are using ffi_type_sint8 for char (1 byte) in our code rather than ffi_type_uchar (which is literally ffi_type_uint8). The reason for this is that we need to pass byte in java to native (which is char) for ffi_call.

@zl-wang
Copy link
Contributor

zl-wang commented Oct 24, 2023

@daltenty

size: 28    <<< typo? 24 is expected
align: 4

@daltenty
Copy link

@daltenty

size: 28    <<< typo? 24 is expected
align: 4

Ah point taken, I've clearly misread that. Yes, while the call works libffi and the compiler do disagree about the size, and when I add the extra parameter I get the crash you are seeing.

@ChengJin01
Copy link
Contributor Author

Hi @daltenty, is there any update on your side?

@daltenty
Copy link

Hi @daltenty, is there any update on your side?

I've posted a PR to libffi to hopefully correct the layout computation: libffi/libffi#805

I think you can try applying that patch to the openj9 local copy of libffi to test and see if it resolves the issue we're seeing here.

@ChengJin01
Copy link
Contributor Author

Hi @daltenty, is there any update on your side?

I've posted a PR to libffi to hopefully correct the layout computation: libffi/libffi#805

I think you can try applying that patch to the openj9 local copy of libffi to test and see if it resolves the issue we're seeing here.

Many thanks. I will try this patch to in our code to see how it goes.

@ChengJin01
Copy link
Contributor Author

I've verified the path offered by @daltenty at libffi/libffi#805 with a recompiled build which works good as expected and all test cases passed in the test suite:

===============================================
test/jdk/java/foreign/nested/TestNested.java
Total tests run: 32, Passes: 32, Failures: 0, Skips: 0
===============================================

ChengJin01 added a commit to ChengJin01/openj9 that referenced this issue Nov 13, 2023
The change simply adopts the fix at libffi/libffi#805
to resolve the the issue with the nested struct in libffi on AIX.

Related: eclipse-openj9#18287

Signed-off-by: ChengJin01 <jincheng@ca.ibm.com>
@pshipton
Copy link
Member

Now that #18375 has been merged, is there an excluded set of tests that need to be enabled?

@keithc-ca
Copy link
Contributor

TestNested seems to be disabled for all platforms for jdk21+:

@ChengJin01
Copy link
Contributor Author

TestNested seems to be disabled for all platforms for jdk21+:

* [Exclude the FFI test suites added in JDK21 adoptium/aqa-tests#4647](https://github.com/adoptium/aqa-tests/pull/4647)

* [Add openjdk/excludes/ProblemList_openjdk22 files adoptium/aqa-tests#4694](https://github.com/adoptium/aqa-tests/pull/4694)

We will need to double-check the test suite again before enabling it on all supported platforms.

@pshipton
Copy link
Member

This is waiting for libffi/libffi#805 to merge so we can update to the approved content.

See #18375 (comment)

@pshipton
Copy link
Member

libffi/libffi#805 has merged without further changes. Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jdk21 project:panama Used to track Project Panama related work test failure
Projects
None yet
Development

No branches or pull requests

7 participants