-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal test failures in MathLoadTest_autosimd #14755
Comments
I looked at this a bit more, since this still happens fairly regularly in the nightly tests.
I'll see if I can narrow down the failure in grinder. |
The failure appears with |
I managed to get a failure (out of ~350 attempts) with the following limit file:
I tried to refine that list by running the test over the weekend, randomly permuting that limit file before each run and limiting to the first half of it. I got no failures in ~3000 runs. Given where the failure occurs, I'm assuming that it's |
I tried getting compilation logs for So, I tried getting logs for
If I understand correctly, the server encountered an inaccessible address while attempting to write the compilation log for the method. Then, it tried dumping the IL of the failing method and encountered a mismatch in the size of the known object table. Maybe that second crash is expected? I'm not sure if the server can become out-of-sync with the client if a crash occurs on a server compilation thread. Other crashes have not had the secondary crash on the diagnostic thread. Otherwise there could be a problem with the known object table. |
I looked into the server crash a bit more. It was crashing after printing a node introduced in the Since it's a little long, I'll just attach the compilation log that I got: tracelogInteger.1722517.server.140.zip What I think what's happening is:
After this, the log is slightly corrupted. The
to this vsub line:
The I'm not as familiar with how autosimd and the node/tree creation and cloning going on here is supposed to work. Marius mentioned to me that @hzongaro might be able to help, or suggest someone that might be able to help, which would be appreciated. |
I said in #14755 (comment) that block 143 looked like the Also, I looked at the server core (the one dumped during the crash while writing that log file) and the node that the server was attempting to print had an However, immediately after the log file prints the opcode name, it then tries to print
and we must have gone down the true branch of that conditional expression. But for me, according to
and the inaccessible address that the log writer tried to access is exactly The vector name initialization routine is here in In my core, the values of the constants are:
which does seem to check out. When I print out the entries of |
It looks like the documentation says that
Could there be some strange race condition where two threads simultaneously try to |
In addition to the possible unsolved race condition, the |
I'm remembering that |
Or there's something weird going on with translation units. |
I now know what's actually going wrong with the server logs. The initialization function for the vector names is here, like I said above: Notice that the memory for the names is allocated with openj9/runtime/compiler/env/J9CompilerEnv.cpp Lines 97 to 112 in 70f99fd
If the JITServer is in a compilation thread and per-client memory is enabled, it will use that per-client memory, and otherwise use the global persistent memory. But the vector names are initialized on-demand within a compilation thread! So we're allocating the vector names using one client's memory, which gets freed when that client disconnects. That's how the vector name pointers in that array can suddenly point to inaccessible memory. |
While I was testing this fix, I did another test run and confirmed that the test error still occurs with |
Observed in the JITAAS pipeline tests:
MathLoadTest_autosimd_CS_5m_0
job/Test_openjdk8_j9_sanity.system_x86-64_linux_jit_Personal/292/
MathLoadTest_autosimd_special_5m_29
job/Test_openjdk17_j9_special.system_x86-64_linux_jit_Personal/213/
job/Test_openjdk8_j9_special.system_x86-64_linux_jit_Personal/291/
job/Test_openjdk17_j9_special.system_x86-64_linux_jit_Personal/217/
Console output for
MathLoadTest_autosimd_CS_5m_0
I haven't been able to reproduce
MathLoadTest_autosimd_special_5m_29
in Grinder, butMathLoadTest_autosimd_CS_5m_0
was reproduced with 1/80 failures injob/Grinder/21725/
with JITAAS and 0/200 failures injob/Grinder/21737/
without JITAAS.The text was updated successfully, but these errors were encountered: