Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only the diagnostic threads should be executing JitDump compilations #11772

Closed
fjeremic opened this issue Jan 25, 2021 · 4 comments · Fixed by #11825
Closed

Only the diagnostic threads should be executing JitDump compilations #11772

fjeremic opened this issue Jan 25, 2021 · 4 comments · Fixed by #11825
Labels

Comments

@fjeremic
Copy link
Contributor

The original design behind the diagnostic compilation thread was to have a by default suspended compilation thread which will only get activated when the JitDump process is triggered. The idea is that the "normal" compilation threads may not be in the correct state to start compiling, because we have just encountered an error. The diagnostic compilation thread has just been sitting and waiting for someone to activate it and is fully capable of triggering a compilation.

Unfortunately as we will describe below via logging, it seem that "normal" compilation threads are picking up JitDump compiles. This is something we do not want, because there are a number of things that we do for JitDump compilations which are special (among others which I am likely missing):

  • We install a custom signal handler to ensure recursive crashes during the JitDump process are handled well
  • We enable tracing based off of whether we are a diagnostic thread or not

This can be observed via a simple java -version command:

fjeremic@Filips-MacBook-Pro openj9-openjdk-jdk11 % ./build/macosx-x86_64-normal-server-release/jdk/bin/java '-Xjit:{sun/nio/fs/UnixPath.checkNotNul(Ljava/lang/String;C)V}(crashduringcompile),verbose={*},vlog=vlog' -Xdump:none -Xdump:jit -version
Assertion failed at /Users/fjeremic/Projects/openj9-openjdk-jdk11/omr/compiler/compile/OMRCompilation.cpp:1036: !self()->getOption(TR_CrashDuringCompilation)
VMState: 0x00050080
	crashDuringCompile option is set
compiling sun/nio/fs/UnixPath.checkNotNul(Ljava/lang/String;C)V at level: warm

Unhandled exception
Type=Segmentation error vmState=0x00050080
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=000000000E4CF750 Handler2=000000000E764AC0 InaccessibleAddress=0000000000000000
RDI=000000000FC99C38 RSI=000000000000A603 RAX=00007FFAFDD0FAA0 RBX=0000000045801000
RCX=000000000FB8A35F RDX=000000000000A603 R8=0000000000000013 R9=0000000000000013
R10=0000000000000013 R11=00007FFEE183C9B0 R12=0000000045801000 R13=0000000000000003
R14=00007FFAF1299B20 R15=000000000000019A
RIP=000000000EC3E484 GS=0000 FS=0000 RSP=0000700002667ED0
RFlags=0000000000010206 CS=002B RBP=0000700002668D80 ERR=0000000000000006
TRAPNO=000000060000000E CPU=0000000000060000 FAULTVADDR=0000000000000000
XMM0 0000ffff00000000 (f: 0.000000, d: 1.390650e-309)
XMM1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM2 000000003f7f91ac (f: 1065324992.000000, d: 5.263405e-315)
XMM3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM6 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM8 0f0f0f0f0f0f0f0f (f: 252645136.000000, d: 3.815737e-236)
XMM9 0302020102010100 (f: 33620224.000000, d: 3.524484e-294)
XMM10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM11 0000000000000002 (f: 2.000000, d: 9.881313e-324)
XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Module=/Users/fjeremic/Projects/openj9-openjdk-jdk11/build/macosx-x86_64-normal-server-release/jdk/lib/compressedrefs/libj9jit29.dylib
Module_base_address=000000000E8DE000 Symbol=_ZN2TR4trapEv
Symbol_address=000000000EC3E3F0

Method_being_compiled=sun/nio/fs/UnixPath.checkNotNul(Ljava/lang/String;C)V
Target=2_90_20210122_000000 (Mac OS X 10.16)
CPU=amd64 (16 logical CPUs) (0x800000000 RAM)

----------- Stack Backtrace -----------
---------------------------------------
JVMDUMP039I Processing dump event "gpf", detail "" at 2021/01/25 13:42:20 - please wait.
JVMDUMP032I JVM requested JIT dump using '/Users/fjeremic/Projects/openj9-openjdk-jdk11/jitdump.20210125.134220.69229.0001.dmp' in response to an event
Assertion failed at /Users/fjeremic/Projects/openj9-openjdk-jdk11/omr/compiler/compile/OMRCompilation.cpp:1036: !self()->getOption(TR_CrashDuringCompilation)
VMState: 0x00050080
	crashDuringCompile option is set
compiling sun/nio/fs/UnixPath.checkNotNul(Ljava/lang/String;C)V at level: warm

JVMDUMP010I JIT dump written to /Users/fjeremic/Projects/openj9-openjdk-jdk11/jitdump.20210125.134220.69229.0001.dmp
JVMDUMP013I Processed dump event "gpf", detail "".

Note that "crashDuringCompile option is set" is seen twice. The first time is when we trigger the initial compilation, and the second time is when we are doing the JitDump compilation. I'm attaching both the jidump trace log and the verbose log to this issue for future reference. Let's dissect the verbose log to see what happen inside.

jitdump-verbose-log.zip

@fjeremic
Copy link
Contributor Author

fjeremic commented Jan 25, 2021

The first thing we will notice is that in the jidump the <logRecompilation> block is empty modulo this one message (which I'm still trying to understand how and why it is printed; any help is appreciated):

<logRecompilation>
[ 25434] O^O COLD BLOCK MARKER: coldBlockMarker marked block_3 cold
            (Building alias info)
</logRecompilation>
</jitDump>

The reason for this will become clear once we examine the verbose log. Opening that up we see the following:

 (warm) Compiling sun/nio/fs/UnixPath.checkNotNul(Ljava/lang/String;C)V  OrdinaryMethod j9m=000000001C0F9038 t=410 compThreadID=3 memLimit=262144 KB freePhysicalMemory=11927 MB
-precompileMethod sun/nio/fs/UnixPath.checkNotNul(Ljava/lang/String;C)V
#DISPATCH: Compiling sun/nio/fs/UnixPath.checkNotNul(Ljava/lang/String;C)V @ warm
#JITDUMP:  Filip:  in addMethodToBeCompiled getNumTotalCompilationThreads = 8
#JITDUMP:  Filip:  location #1
#RA:  Adding ClassRedefinitionPIC assumption: TR_RedefinedClassPicSite@00007FFAE0085C40: key=000000001C065EC0 picLocation=0000000030E02972 size=8
#JITDUMP:  Filip:  location #3
#JITDUMP:  Filip: about to crash

compThreadID=3 picked up the original compilation for our compile and of course we hit the crashduringcompile code and we end up triggering an assert which causes the crash:
https://github.com/eclipse/omr/blob/b38b0b77c9972bc0b8d01ed66665270a0f108845/compiler/compile/OMRCompilation.cpp#L1031-L1032

The other compilation threads are still processing the queue, and the application threads are still running. This is why we see lots of stuff printed following the crash point of the compilation on compThreadID=3. A bit later we see the crashed thread has reached the JitDump process:

#JITDUMP:  JIT dump initiated. Crashed vmThread=000000001C01EE00
...
#JITDUMP:  Filip: Jitdump: num comp threads active = 4
#JITDUMP:  Notified all waiting threads
#JITDUMP:  Disabled further compilation
#JITDUMP:  Filip: Jitdump: num comp threads active = 4
#JITDUMP:  Filip: Jitdump: num comp threads active after stopping = 4
#JITDUMP:  Diagnostic compilation thread available - purging compilation queue
#JITDUMP:  Filip: Jitdump: num comp threads active = 4
#JITDUMP:  crashed in compilation thread
#JITDUMP:  Found compilation object

The JitDump logic in JitDump.cpp sets a flag to disable all future compilations and it purges the entire compilation queue. However as we can note by the above printouts (inserted in code by me) the number of active threads does not decrease once we set the flag, or purge the compilation queue. This is very important. We rely on the normal compilation thread suspension logic to suspend the compilation threads when the following conditions are met:

  • Compilations are suspended, so nothing else is getting added to the queue
  • The queue is empty so there is nothing to compile

A bit later the JitDump logic resumes the diagnostic thread (compThreadID=7):

#JITDUMP:  Resuming DiagCompThread
...
#INFO:  t=   410 Resume request for compThread 7
#JITDUMP:  recompiling a method for log: 000000001C0F9038
#DISPATCH: Successfully created compiled body [0000000030E02CD8-0000000030E02DC2] for java/util/HashMap.tableSizeFor(I)I @ warm
#JITDUMP:  compileMethod() about to issued synchronously
#INFO:  t=   410 Suspending compThread 7 due to empty queue Qweight=30 active=3 overallCompCpuUtil=0

However the logging tells us that the diagnostic thread is resumed, and then subsequently it started processing the queue, and encountered that the queue is empty and it forces itself to get suspended. This occurs because the diagnostic thread is enabled in between purging the compilation queue and queuing the JitDump compilation:
https://github.com/eclipse/openj9/blob/6485d7550c994521c5ba78744d5530b78986d9a9/runtime/compiler/control/JitDump.cpp#L658-L686

There is actually a TODO there which talks exactly about this which was made likely when the original JitDump design was put together, but never fixed.

Remember, during all of this the application is still executing and the VM is still suggesting to the JIT compiler that we should compile methods, as their count goes down to zero. This is why we see compile requests coming in, but they're not satisfied because we have suspended further compilations:

#CR:  000000001C00FF00   Compile request OrdinaryMethod j9method=000000001C13B810 jdk/internal/org/objectweb/asm/ByteVector.putUTF8(Ljava/lang/String;)Ljdk/internal/org/objectweb/asm/ByteVector; optLevel=2
#JITDUMP:  Filip:  location #9

#JITDUMP:  Filip:  DLT stuff
#JITDUMP:  Filip:  Doing the compile
#JITDUMP:  Filip:  About to wait for compilation to finish
#FAILURE:  t=410 <WARNING: JIT Compilations are suspended>
...
 (warm) Compiling sun/nio/fs/UnixPath.checkNotNul(Ljava/lang/String;C)V  DumpMethod j9m=000000001C0F9038 t=410 compThreadID=1 memLimit=262144 KB freePhysicalMemory=11927 MB

Eventually though we do reach a point where compThreadID=1 has picked up our DumpMethod (JitDump) compilation. Note however that compThreadID=7 was the diagnostic thread. This means that a normal compilation thread has picked up a JitDump compile. This is problematic because the following code will not get executed:
https://github.com/eclipse/openj9/blob/6485d7550c994521c5ba78744d5530b78986d9a9/runtime/compiler/control/CompilationThread.cpp#L8147-L8171

i.e. the options to enable tracing are not activated if a normal compilation thread picks up a JitDump compile. This is why our <logRecompilation> is empty!

@fjeremic
Copy link
Contributor Author

fjeremic commented Jan 25, 2021

In my mind the solution to this problem seems to be the following:

  • Change the logic for setting options for JitDump compiles to check for the method details, i.e. query the isJitDumpMethod() API, rather than query whether the thread is a diagnostic thread
    • This will ensure <logRecompilation> always provides tracing
  • Change the compilation queue processing logic so that the diagnostic thread is the only thread who can pick up JitDump compilation requests
  • Change the JitDump logic so the following occurs sequentially:
    1. Compilations are suspended
    2. Compilation queue is purged
    3. An interruption request is sent to all compilation threads to terminate their current compilation at the next yield point
    4. All JitDump compilation requests are added to the queue
    5. The diagnostic thread is enabled

@dsouzai what are your thoughts given the above analysis?

Edit:

Thinking about this more the following logic may not work:

  • All JitDump compilation requests are added to the queue
  • The diagnostic thread is enabled

because the JitDump compilation requests are synchronous, so we need to enable the diagnostic thread before the synchronous compile request. The better solution here would be to swap the order, but teach the self-suspension logic not to suspend the diagnostic thread at all. The JitDump process will just suspend the diagnostic thread before finishing.

@fjeremic
Copy link
Contributor Author

I also explored the idea of attempting to suspend all "normal" compilation threads right in the JitDump logic, and waiting for that to happen. I used the following code:

   compInfo->acquireCompMonitor(crashedThread);

   // NOTE: Comp monitor is held while this is happening
   for (uint8_t i = 0; i < compInfo->getNumTotalCompilationThreads(); i++)
      {
      TR::CompilationInfoPerThread *curCompThreadInfoPT = compInfo->getArrayOfCompilationInfoPerThread()[i];
      TR_ASSERT(curCompThreadInfoPT, "a thread's compinfo is missing\n");

      // Set the flag for the compilation-in-progress to be aborted at the next yield point unless the thread is
      // a diagnostic thread processing a jitdump operation, in which case we want to let it finish. Note that the
      // below switch statement will still request that even the diagnostic threads be terminated, but they will
      // do so once returning to their state processing loop.
      if (!curCompThreadInfoPT->isDiagnosticThread())
         curCompThreadInfoPT->setCompilationShouldBeInterrupted(SHUTDOWN_COMP_INTERRUPT);
      else
         {
         continue;
         }

      switch (curCompThreadInfoPT->getCompilationThreadState())
         {
         case COMPTHREAD_SUSPENDED:
            curCompThreadInfoPT->setCompilationThreadState(COMPTHREAD_SIGNAL_TERMINATE);
            curCompThreadInfoPT->getCompThreadMonitor()->enter();
            curCompThreadInfoPT->getCompThreadMonitor()->notifyAll();
            curCompThreadInfoPT->getCompThreadMonitor()->exit();
            break;

         case COMPTHREAD_ACTIVE:
         case COMPTHREAD_SIGNAL_WAIT:
         case COMPTHREAD_WAITING:
            curCompThreadInfoPT->setCompilationThreadState(COMPTHREAD_SIGNAL_TERMINATE);
            // Only decrement the number of active threads if we are not a diagnostic
            // thread
            if (!(curCompThreadInfoPT->isDiagnosticThread()))
               compInfo->decNumCompThreadsActive();
            break;

         case COMPTHREAD_SIGNAL_SUSPEND:
            curCompThreadInfoPT->setCompilationThreadState(COMPTHREAD_SIGNAL_TERMINATE);
            break;

         case COMPTHREAD_SIGNAL_TERMINATE:
         case COMPTHREAD_STOPPING:
         case COMPTHREAD_STOPPED:
            // weird case; we should not do anything
            break;

         case COMPTHREAD_UNINITIALIZED:
            // Compilation thread did not have time to become fully initialized
            curCompThreadInfoPT->setCompilationThreadState(COMPTHREAD_SIGNAL_TERMINATE);
            break;

         default:
            TR_ASSERT(false, "No other comp thread state possible here\n");
         }
      }

   TR_ASSERT_FATAL(compInfo->getNumCompThreadsActive() == 0, "All threads must be inactive at this point");

   compInfo->purgeMethodQueue(compilationFailure);

   for (uint8_t i = 0; i < compInfo->getNumTotalCompilationThreads(); i++)
      {
      TR::CompilationInfoPerThread *curCompThreadInfoPT = compInfo->getArrayOfCompilationInfoPerThread()[i];
      if (!curCompThreadInfoPT->isDiagnosticThread() && curCompThreadInfoPT != threadCompInfo)
         {
         while (curCompThreadInfoPT->getCompilationThreadState() != COMPTHREAD_STOPPED)
            {
            compInfo->getCompilationMonitor()->notifyAll();
            compInfo->waitOnCompMonitor(crashedThread);
            }
         }
      }

   compInfo->releaseCompMonitor(crashedThread);

Unfortunately this will not work. The reason because it is not a good idea to stop/suspend compilation threads in the JitDump logic and wait until they are all stopped/suspended before proceeding with the JitDump logic is because we have a deadlock scenario here because the stop/suspension request has to happen at a yield point the compilation threads. These yield points can be things like attempting to acquire VM access, attempting to acquire various monitors, or attempting to reserve trampoline space, etc. Basically anywhere that checks compilationShouldBeInterrupted.

However there is now a timing hole between when the JitDump logic has requested compilation threads to stop/suspend and when they actually reach such a yield point. The deadlock scenario happens if a compilation thread happens to crash in between these two points. It will then go through the dump triggers and it will halt on the dump monitor which prevents multiple crashes from being processed at the same time. Hence such a thread will never stop/suspend because the previous JitDump process holds such a lock and is waiting for other compilation threads to stop/suspend, i.e. a deadlock scenario.

For this reason I do not think it is a good idea to introduce any serialization logic in the JitDump request logic. Ensuring that only the diagnostic thread can process JitDumpMethodDetails requests will ensure everything is handled properly.

@dsouzai
Copy link
Contributor

dsouzai commented Jan 25, 2021

For this reason I do not think it is a good idea to introduce any serialization logic in the JitDump request logic. Ensuring that only the diagnostic thread can process JitDumpMethodDetails requests will ensure everything is handled properly.

Yeah I think this is the best way of going about it; it's probably the simplest solution too.

fjeremic added a commit to fjeremic/openj9 that referenced this issue Feb 1, 2021
…ump compile

Only the diagnostic thread should be processing JitDump compilations.
This is to ensure proper tracing is active and proper synchronization
is performed w.r.t. special events (interpreter shutdown, etc.). The
logic here is split into two parts. In the event we are a diagnostic
thread, the JitDump process will have ensured the method compilation
queue has been purged and all further compilations have been suspended
until the entire JitDump process in complete. This also means if we are
a diagnostic thread, then the only entries in the queue should be
JitDump compile requests. We will ensure this is the case via a fatal
assert.

In addition there can be scenarios in which a non-diagnostic
compilation thread is still attempting to process entries while the
JitDump process (in a crashed thread) is happening. This is because one
compilation thread must always be active, and there is a timing hole
between purging of the queue in the JitDump process, and when the
diagnostic thread is resumed. This means a non-diagnostic thread could
attempt to pick up a JitDump compilation request (see
eclipse-openj9#11772 for details). We must ensure that this does not
happen and simply skip the request if we are a non-diagnostic thread
attempting to process a JitDump compilation.

Signed-off-by: Filip Jeremic <fjeremic@ca.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants