jdknext AIX: ASSERTION FAILED at CompositeCache.cpp:2437: #9997

andrew-m-leonard · 2020-06-24T07:37:04Z

https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk/job/jdk-aix-ppc64-openj9/148/console

05:52:43  Creating java.rmi.jmod
05:52:50  Creating java.scripting.jmod
05:53:06  04:53:05.026 0x3000e700   j9shr.1013   *   ** ASSERTION FAILED ** at CompositeCache.cpp:2437: (!(hasReadMutex(currentThread)))
05:53:06  JVMDUMP039I Processing dump event "traceassert", detail "" at 2020/06/24 00:53:05 - please wait.
05:53:06  JVMDUMP032I JVM requested System dump using '/home/jenkins/workspace/build-scripts/jobs/jdk/jdk-aix-ppc64-openj9/workspace/build/src/make/core.20200624.005305.20250638.0001.dmp' in response to an event

The text was updated successfully, but these errors were encountered:

DanHeidinga · 2020-06-24T11:58:12Z

fyi @hangshao0 - this looks related to your recent SCC changes.

hangshao0 · 2020-06-24T14:40:50Z

Seems that the core files are not saved/uploaded by the job.

pshipton · 2020-06-24T15:00:17Z

@hangshao0 there is another one from OpenJ9 https://ci.eclipse.org/openj9/job/Build_JDKnext_ppc64_aix_Personal/41/ with a diagnostic download.

pshipton · 2020-06-24T15:03:31Z

Started jdk8 and 11 builds, I'll likely revert the change if these are affected, unless there is a quick fix.

https://ci.eclipse.org/openj9/job/Pipeline-Build-Test-All/950/ - failed due to #9992 (comment)
https://ci.eclipse.org/openj9/job/Pipeline-Build-Test-All-11/214/ - problem recreated

smlambert · 2020-06-24T17:36:29Z

Note: When possible, please download cores from these builds now (or even better at the moment of reporting the issue) and attach to issue, or linking to a location where core can be found, as those Jenkins links that are being used to report issues have a lifespan of 5 days. Also include the Rerun in Grinder link reported in issues, for the convenience of people starting to help with this issue many days later.

#9835 another example of not enough long-lived info in the issue report which will cost time and resources to try and grind and reproduce.

pshipton · 2020-06-24T17:58:46Z

@lumpfish fyi the previous comment.

andrew-m-leonard · 2020-06-24T18:12:04Z

@smlambert agree we need as much as possible, but it takes an aweful lot of time every morning triaging, took me 2 hours this morning => adoptium/temurin-build#1634 (comment)
and I didn't even look at Hotspot!! Doing all the above for every issue is going to be half the day gone, every day!
So if we can make life a bit easier, like get enough storage to keep logs/dump/consoles for say 30days...?

andrew-m-leonard · 2020-06-24T18:41:59Z

@adamfarley we need some innovation here please?! How about a "button" that "locks" the given test job so it doesn't get deleted? or maybe moves all "artifacts" to some persistent storage elsewhere...?

andrew-m-leonard · 2020-06-24T18:52:00Z

Like maybe just click "Keep this build forever" ?!

hangshao0 · 2020-06-24T19:10:51Z

Found in the trace point:

14:31:48.174972919          0x3000e700      j9shr.1279 Event       SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock
14:31:48.174975550          0x3000e700    omrport.271  Exception * omrfile_lock_bytes: fcntl failed, errno=45
14:31:48.174976397          0x3000e700      j9shr.1501 Event       CM findROMClass: failed to acquire read mutex - returning NULL for class java/io/ByteArrayInputStream with classpath id 0.
14:31:48.175089379          0x3000e700    omrport.271  Exception * omrfile_lock_bytes: fcntl failed, errno=45
14:31:48.175089775          0x3000e700      j9shr.1279 Event       SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock
14:31:48.175091521          0x3000e700    omrport.271  Exception * omrfile_lock_bytes: fcntl failed, errno=45
14:31:48.185117505          0x3000e700      j9shr.1279 Event       SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock
14:31:48.185120743          0x3000e700    omrport.271  Exception * omrfile_lock_bytes: fcntl failed, errno=45
14:31:48.198190325          0x3000e700      j9shr.1279 Event       SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock
14:31:48.198192944          0x3000e700    omrport.271  Exception * omrfile_lock_bytes: fcntl failed, errno=45
14:31:48.208702763          0x3000e700      j9shr.1279 Event       SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock
14:31:48.253201634          0x3000e700      j9shr.2195 Exception * CC changePartialPageProtection: Returning without changing page protection
14:31:48.253201869          0x3000e700      j9shr.2195 Exception * CC changePartialPageProtection: Returning without changing page protection
14:31:48.253202052          0x3000e700      j9shr.2195 Exception * CC changePartialPageProtection: Returning without changing page protection
14:31:48.253202158          0x3000e700      j9shr.2195 Exception * CC changePartialPageProtection: Returning without changing page protection
14:31:48.253298886          0x3000e700      j9shr.1498 Entry      >CM markItemStaleCheckMutex: marking stale cache item at address 0xa00010011129114
14:31:48.253299238          0x3000e700      j9shr.1049 Entry      >CC doLockCache: Locking cache...
14:31:48.253299835          0x3000e700      j9shr.1208 Entry      >CC unprotectMetadataArea: Entering
14:31:48.254806006          0x3000e700      j9shr.1209 Exit       <CC unprotectMetadataArea: Exiting with rc=0
14:31:48.254806286          0x3000e700      j9shr.1051 Exit       <CC doLockCache: Done locking cache
14:31:48.254806540          0x3000e700      j9shr.1500 Exit       <CM markItemStaleCheckMutex: done marking stale cache item at address 0xa00010011129114
14:31:48.254806708          0x3000e700      j9shr.1291 Event       RMI locateROMClass: ROMClass timestamp has changed. Locate request for ROMClass java/io/ByteArrayInputStream from helper ID 0 with cpeIndex 0. Returning NULL.
14:31:48.254807637          0x3000e700      j9shr.2195 Exception * CC changePartialPageProtection: Returning without changing page protection
14:31:48.254828108          0x3000e700      j9shr.1206 Entry      >CC protectMetadataArea: Entering
14:31:48.256258550          0x3000e700      j9shr.1207 Exit       <CC protectMetadataArea: Exiting with rc=0
14:31:48.256306692          0x3000e700      j9shr.1013 Assert    * ** ASSERTION FAILED ** at CompositeCache.cpp:2437: (!(hasReadMutex(currentThread)))

So enterReadMutex failed once:
j9shr.1501 Event CM findROMClass: failed to acquire read mutex - returning NULL.

But J9_PRIVATE_FLAGS2_IN_SHARED_CACHE_READ_MUTEX is still set inside enterReadMutex() even it returns -1. Next time this thread enter the read mutex again, the flag is found to be set and we fail on assertion. We should check the value of rc before setting J9_PRIVATE_FLAGS2_IN_SHARED_CACHE_READ_MUTEX.

hangshao0 · 2020-06-24T22:17:32Z

Comment in the code (https://github.com/eclipse/openj9/blob/master/runtime/shared_common/OSCachemmap.cpp#L748to#L764) suggests there is another thread holding RW mutex in the case of message "SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock". However, I don't think that comment explains what is happening here.

If there is another thread in this JVM holding the RW mutex, we should see the Case 3 message only once. As we will retry with _lockMutex[J9SH_OSCACHE_MMAP_LOCKID_READWRITELOCK], which guarantees that no other threads in this JVM has the RW mutex in the second attempt. However, we are seeing this message multiple times here.

Also I let the JVM crash if we failed to acquire the write mutex. The Case 3 message is still there (multiple times), but no threads in this JVM own the RW mutex.

!j9shrcompositecachecommoninfo 0x0000010020ED0570
J9ShrCompositeCacheCommonInfo at 0x10020ed0570 {
  Fields for J9ShrCompositeCacheCommonInfo:
        0x0: U64 writeMutexEntryCount = 0x0000000000000000 (0)
        0x8: struct J9VMThread* hasWriteMutexThread = !j9vmthread 0x0000000000000000
        0x10: struct J9VMThread* hasReadWriteMutexThread = !j9vmthread 0x0000000000000000
        0x18: struct J9VMThread* hasRefreshMutexThread = !j9vmthread 0x0000000000000000
        0x20: struct J9VMThread* hasRWMutexThreadMprotectAll = !j9vmthread 0x0000000000000000
       ...
}

Something else is going on.

hangshao0 · 2020-06-25T21:18:37Z

There is an Attach API thread locking some file under /tmp/.com_ibm_tools_attach/, which caused j9file_lock_bytes() in SH_OSCachemmap::acquireWriteLock() to fail with error EDEADLK.

Here is what's happening, which gives fcntl() an impression of deadlock:

JVM1:
Thread 1 - Holds the write lock of the SCC.
Thread 2 - Waiting for the write lock of file /tmp/.com_ibm_tools_attach/_attachlock

JVM2:
Thread 3 - Holds the write lock of file /tmp/.com_ibm_tools_attach/_attachlock
Thread 4 - Calling j9file_lock_bytes() to wait for the write mutex of the SCC in SH_OSCachemmap::acquireWriteLock() -> failed with EDEADLK.

This is an existing behaviour, not something introduced recently. Not sure if we can do better in this case. I guess our current behaviour that lets SH_OSCachemmap::acquireWriteLock() return -1 after it failed on several attempts is fine.

smlambert · 2020-06-26T12:29:55Z

related to: #9997 (comment), Understand the time pressure and appreciate the concern @andrew-m-leonard. Perhaps it is better to triage less number of things but in a more thorough manner, so the issues can be closed (and not keep showing up to be triaged). As it is now, if we do not include enough info we are merely shifting the work/effort to the next person picking up the issue (which happens at a later time, so can become an impossible task for the next person in the chain).

We're adding functionality to the "Create new issue" button in TRSS that will automatically include all links (and git diffs, and java -version info, rerun links, artifactory links if present and first occurrence see adoptium/aqa-test-tools#258)

It would be useful to have someone record a typical nightly build & test triage session to understand how it is currently done, so we can improve the process for those who do it.

Fixes eclipse-openj9#9997 Signed-off-by: Hang Shao <hangshao@ca.ibm.com>

andrew-m-leonard mentioned this issue Jun 24, 2020

Nightly Build&Test Triage Report adoptium/temurin-build#1634

Closed

DanHeidinga added the comp:vm label Jun 24, 2020

DanHeidinga added this to the Release 0.22 (Java 15) milestone Jun 24, 2020

This was referenced Jun 24, 2020

Restore ability to build with OpenJ9 ibmruntimes/openj9-openjdk-jdk#216

Merged

Add a private flag to indicate a thread is in SCC read mutex #9988

Merged

Revert "Add a private flag to indicate a thread is in SCC read mutex" #10002

Merged

keithc-ca mentioned this issue Jun 24, 2020

Create javadoc for only openj9 extensions ibmruntimes/openj9-openjdk-jdk11#320

Merged

hangshao0 added a commit to hangshao0/openj9 that referenced this issue Jun 29, 2020

Check the return value before setting the flag for SCC read mutex

e20f374

Fixes eclipse-openj9#9997 Signed-off-by: Hang Shao <hangshao@ca.ibm.com>

hangshao0 mentioned this issue Jun 29, 2020

Add a private flag to indicate a thread is in SCC read mutex #10043

Merged

pshipton closed this as completed in #10043 Jun 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jdknext AIX: ASSERTION FAILED at CompositeCache.cpp:2437: #9997

jdknext AIX: ASSERTION FAILED at CompositeCache.cpp:2437: #9997

andrew-m-leonard commented Jun 24, 2020

DanHeidinga commented Jun 24, 2020

hangshao0 commented Jun 24, 2020

pshipton commented Jun 24, 2020

pshipton commented Jun 24, 2020 •

edited

smlambert commented Jun 24, 2020 •

edited

pshipton commented Jun 24, 2020

andrew-m-leonard commented Jun 24, 2020

andrew-m-leonard commented Jun 24, 2020

andrew-m-leonard commented Jun 24, 2020

hangshao0 commented Jun 24, 2020 •

edited

hangshao0 commented Jun 24, 2020 •

edited

hangshao0 commented Jun 25, 2020

smlambert commented Jun 26, 2020 •

edited

jdknext AIX: ** ASSERTION FAILED ** at CompositeCache.cpp:2437: #9997

jdknext AIX: ** ASSERTION FAILED ** at CompositeCache.cpp:2437: #9997

Comments

andrew-m-leonard commented Jun 24, 2020

DanHeidinga commented Jun 24, 2020

hangshao0 commented Jun 24, 2020

pshipton commented Jun 24, 2020

pshipton commented Jun 24, 2020 • edited

smlambert commented Jun 24, 2020 • edited

pshipton commented Jun 24, 2020

andrew-m-leonard commented Jun 24, 2020

andrew-m-leonard commented Jun 24, 2020

andrew-m-leonard commented Jun 24, 2020

hangshao0 commented Jun 24, 2020 • edited

hangshao0 commented Jun 24, 2020 • edited

hangshao0 commented Jun 25, 2020

smlambert commented Jun 26, 2020 • edited

jdknext AIX: ASSERTION FAILED at CompositeCache.cpp:2437: #9997

jdknext AIX: ASSERTION FAILED at CompositeCache.cpp:2437: #9997

pshipton commented Jun 24, 2020 •

edited

smlambert commented Jun 24, 2020 •

edited

hangshao0 commented Jun 24, 2020 •

edited

hangshao0 commented Jun 24, 2020 •

edited

smlambert commented Jun 26, 2020 •

edited