JVM Crash in heavily threaded environment #8977

treo · 2020-05-27T10:44:05Z

Issue Description

While training an RL4J model, a user has reported JVM Crashes (https://community.konduit.ai/t/fatal-error-terminate-called-after-throwing-an-instance-of-std-runtime-error-what-bad-data-type/447/14)

I used the shared code to reproduce the issue, but I had to crank up the thread count to 128 in order to force the crash to appear early in the run.

The error logs of the user indicate that ConcMarkSweepGC was used. And with that option I had crashes like that:

# first
[ERROR] Unknown dtypeX=0 on D:/jenkins/ws/dl4j-deeplearning4j-1.0.0-beta7-windows-x86_64-cpu-avx2/libnd4j/include/legacy/cpu/NativeOpExecutioner.cpp:1310
terminate called recursively
terminate called recursively
terminate called after throwing an instance of 'std::runtime_error'
  what():  bad data type
terminate called recursively
terminate called recursively
terminate called recursively

#second
[ERROR] Unknown dtypeX=1400475768 on D:/jenkins/ws/dl4j-deeplearning4j-1.0.0-beta7-windows-x86_64-cpu-avx2/libnd4j/include/legacy/cpu/NativeOpExecutioner.cpp:1310
terminate called after throwing an instance of 'std::runtime_error'
  what():  bad data type

#third
[ERROR] Unknown dtypeX=1283687984 on D:/jenkins/ws/dl4j-deeplearning4j-1.0.0-beta7-windows-x86_64-cpu-avx2/libnd4j/include/legacy/cpu/NativeOpExecutioner.cpp:1310
terminate called after throwing an instance of 'std::runtime_error'
  what():  bad data type

And a few full on JVM crashes likes this:

#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000003d9aeb90, pid=3408, tid=0x0000000000001f90
#
# JRE version: Java(TM) SE Runtime Environment (8.0_162-b12) (build 1.8.0_162-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.162-b12 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C  [libnd4jcpu.dll+0xe3eb90]

It always was a different operation it crashed on, but they all had a common root:

J 4277 C2 org.deeplearning4j.rl4j.learning.async.AsyncThreadDiscrete.trainSubEpoch(Lorg/deeplearning4j/rl4j/observation/Observation;I)Lorg/deeplearning4j/rl4j/learning/async/AsyncThread$SubEpochReturn; (315 bytes) @ 0x0000000003eeea20 [0x0000000003eee000+0xa20]
J 4563 C2 org.deeplearning4j.rl4j.learning.async.AsyncThread.handleTraining(Lorg/deeplearning4j/rl4j/learning/async/AsyncThread$RunContext;)Z (79 bytes) @ 0x00000000040c42c8 [0x00000000040c4260+0x68]
j  org.deeplearning4j.rl4j.learning.async.AsyncThread.run()V+97

I've also tried using SerialGC, with the result that the JVM still crashed, but with no useful error message.

As far as I can tell, the parallel training in this instance is using separate models and uses synchronized access to the shared model:
https://github.com/eclipse/deeplearning4j/blob/master/rl4j/rl4j-core/src/main/java/org/deeplearning4j/rl4j/learning/async/AsyncThreadDiscrete.java

Version Information

Please indicate relevant versions, including, if relevant:

Deeplearning4j version: 1.0.0-beta7
Platform information (OS, etc): Windows 10
CUDA version, if used: N/A
NVIDIA driver version, if in use: N/A

The text was updated successfully, but these errors were encountered:

hurui200320 · 2020-10-30T06:02:35Z

Same issue here. First time I notice that is jvm return a 0xc0000005, no more information, the program just end with a return value. After wiggling a bit, JVM detect the crash and generate a hs_err*.log

Codes are below. I marked out the crash section. Get the output of the network would cause the crash. Looks like there are random crash during the testRawInput.forEachLine. Sometimes it crash after the first iteration with the return code 0xc0000005, sometimes it run a few times(4 or 5 iter.) then the JVM generate the hs_err. The basic idea of this piece of code is reading a SIGHAN 2015 test set and process it into [1, 300, 128] NDArray, where 300 is the dimention of word vector, 128 is the padded time step. Then get the result and turn it into offical format.

The issuse only shows when using CPU backend, using GPU backend is totally ok.

package info.skyblond.fiona.zhwiki

import com.hankcs.hanlp.HanLP
import info.skyblond.fiona.dl4j.Word2VecHelper
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork
import org.nd4j.linalg.factory.Nd4j
import java.io.File
import java.nio.charset.StandardCharsets

fun main() {
    val testRawInput = File("data/SIGHAN 2015 CSC Datasets/Test/SIGHAN15_CSC_TestInput.txt")

    val model = MultiLayerNetwork.load(
        File("checkpoint_38_MultiLayerNetwork.zip"),
        true
    )

    testRawInput.forEachLine(StandardCharsets.UTF_8) {rawLine ->
        val cleanString = // processed line content to be feed into network

        // input data shape: [batchSize, features, timesteps]
        // i.e.: [1, 300, 128]
        val input = Nd4j.zeros(1, 300, 128)
        cleanString.take(128).forEachIndexed { timeSteps, s ->
            // result is a list of double, size is 300
            val result = if (Word2VecHelper.containsKey(s)) Word2VecHelper[s] else Word2VecHelper["<unk>"]
            result.forEachIndexed { featureIndex, d ->
                input.putScalar(intArrayOf(1, featureIndex, timeSteps), d)
            }
        }

        //---------------------- CRASH IN BELOW CODE ---------------------------

        // 获得输出
        // output: [batchSize, nOut, timesteps]
        // i.e.: [1, 3, 128], after argmax(1), shape should be [1, 128]
        val output = model.output(input).argMax(1) // CRASH HERE
        val tagRawResult = mutableListOf<String>()
        for (i in cleanString.indices){
            val tag = when(output.getInt(1, i)) {
                0 -> "O"
                1 -> "B"
                2 -> "I"
                else -> "<unk>"
            }
            tagRawResult.add(tag)
        }

        //---------------------- CRASH IN ABOVE CODE --------------------------

        // process into SIGHAN 2015 offical tool' format
        // ......
}

Network conf:

val conf = NeuralNetConfiguration.Builder()
            // training setting...
            .list()
            .layer(
                0, Bidirectional(
                    Bidirectional.Mode.ADD,
                    LSTM.Builder()
                        .activation(Activation.TANH)
                        .nIn(300)
                        .nOut(hiddenLayerSize)
                        .build()
                )
            )
            .layer(
                1,
                RecurrentAttentionLayer.Builder()
                    .nIn(hiddenLayerSize)
                    .nOut(feedforwardLayerSize1)
                    .projectInput(true)
                    .nHeads(attentionHeadNumber)
                    .build(),
            )
            .layer(
                2, DenseLayer.Builder()
                    .nIn(feedforwardLayerSize1)
                    .nOut(feedforwardLayerSize2)
                    .activation(Activation.RELU)
                    .build()
            )
            .layer(
                3,
                RnnOutputLayer.Builder(LossFunctions.LossFunction.MCXENT)
                    .dropOut(0.0) // turn off dropout
                    .activation(Activation.SOFTMAX)
                    .nIn(feedforwardLayerSize2)
                    .nOut(numLabelClasses)
                    .build(),
            )
            .inputPreProcessor(2, RnnToFeedForwardPreProcessor())
            .inputPreProcessor(3, FeedForwardToRnnPreProcessor())
            .build()

Parts of the hs_err:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000000365eb90, pid=18192, tid=17144
#
# JRE version: OpenJDK Runtime Environment (11.0.7+10) (build 11.0.7+10)
# Java VM: OpenJDK 64-Bit Server VM (11.0.7+10, mixed mode, tiered, compressed oops, g1 gc, windows-amd64)
# Problematic frame:
# C  [libnd4jcpu.dll+0xe3eb90]
#
# No core dump will be written. Minidumps are not enabled by default on client versions of Windows
#
# If you would like to submit a bug report, please visit:
#   https://github.com/AdoptOpenJDK/openjdk-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  S U M M A R Y ------------

Command Line: -Xmx12G -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2019.2.4\lib\idea_rt.jar=62999:C:\Program Files\JetBrains\IntelliJ IDEA 2019.2.4\bin -Dfile.encoding=UTF-8 info.skyblond.fiona.zhwiki.TestOnSIGHAN2015Kt

Host: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, 8 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.546)
Time: Thu Oct 29 22:27:13 2020 ?��??????? elapsed time: 44 seconds (0d 0h 0m 44s)

---------------  T H R E A D  ---------------

Current thread (0x0000026417007000):  JavaThread "main" [_thread_in_native, id=17144, stack(0x0000008600900000,0x0000008600a00000)]

Stack: [0x0000008600900000,0x0000008600a00000],  sp=0x00000086009fdd88,  free space=1015k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libnd4jcpu.dll+0xe3eb90]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 3209  org.nd4j.nativeblas.Nd4jCpu.execTransformStrict(Lorg/bytedeco/javacpp/PointerPointer;ILorg/nd4j/nativeblas/OpaqueDataBuffer;Lorg/bytedeco/javacpp/LongPointer;Lorg/bytedeco/javacpp/LongPointer;Lorg/nd4j/nativeblas/OpaqueDataBuffer;Lorg/bytedeco/javacpp/LongPointer;Lorg/bytedeco/javacpp/LongPointer;Lorg/bytedeco/javacpp/Pointer;)V (0 bytes) @ 0x0000026428b6e04e [0x0000026428b6df40+0x000000000000010e]
J 3843 c2 org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.exec(Lorg/nd4j/linalg/api/ops/TransformOp;Lorg/nd4j/linalg/api/ops/OpContext;)V (1440 bytes) @ 0x0000026428c21154 [0x0000026428c1ff00+0x0000000000001254]
J 3811 c2 org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.exec(Lorg/nd4j/linalg/api/ops/Op;Lorg/nd4j/linalg/api/ops/OpContext;)Lorg/nd4j/linalg/api/ndarray/INDArray; (143 bytes) @ 0x0000026428be9744 [0x0000026428be95e0+0x0000000000000164]
J 3971 c2 org.nd4j.autodiff.samediff.internal.InferenceSession.doExec(Lorg/nd4j/autodiff/functions/DifferentialFunction;Lorg/nd4j/linalg/api/ops/OpContext;Lorg/nd4j/autodiff/samediff/internal/FrameIter;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;)[Lorg/nd4j/linalg/api/ndarray/INDArray; (1718 bytes) @ 0x0000026428cb2224 [0x0000026428cb1ec0+0x0000000000000364]
J 3920 c2 org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(Lorg/nd4j/common/primitives/Pair;Lorg/nd4j/autodiff/samediff/internal/FrameIter;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/util/List;Lorg/nd4j/autodiff/listeners/At;Lorg/nd4j/linalg/dataset/api/MultiDataSet;Ljava/util/Set;)[Lorg/nd4j/linalg/api/ndarray/INDArray; (1264 bytes) @ 0x0000026428c86888 [0x0000026428c86700+0x0000000000000188]
J 3972 c2 org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(Ljava/lang/Object;Lorg/nd4j/autodiff/samediff/internal/FrameIter;Ljava/util/Set;Ljava/util/Set;Ljava/util/Set;Ljava/util/List;Lorg/nd4j/autodiff/listeners/At;Lorg/nd4j/linalg/dataset/api/MultiDataSet;Ljava/util/Set;)[Ljava/lang/Object; (23 bytes) @ 0x0000026428cb3c2c [0x0000026428cb3bc0+0x000000000000006c]
J 3870 c1 org.nd4j.autodiff.samediff.internal.AbstractSession.output(Ljava/util/List;Ljava/util/Map;Lorg/nd4j/linalg/dataset/api/MultiDataSet;Ljava/util/Collection;Ljava/util/List;Lorg/nd4j/autodiff/listeners/At;)Ljava/util/Map; (2553 bytes) @ 0x0000026421e2e01c [0x0000026421e240c0+0x0000000000009f5c]
j  org.nd4j.autodiff.samediff.SameDiff.directExecHelper(Ljava/util/Map;Lorg/nd4j/autodiff/listeners/At;Lorg/nd4j/linalg/dataset/api/MultiDataSet;Ljava/util/Collection;Ljava/util/List;[Ljava/lang/String;)Ljava/util/Map;+175
j  org.nd4j.autodiff.samediff.SameDiff.batchOutputHelper(Ljava/util/Map;Ljava/util/List;Lorg/nd4j/autodiff/listeners/Operation;[Ljava/lang/String;)Ljava/util/Map;+196
j  org.nd4j.autodiff.samediff.SameDiff.output(Ljava/util/Map;Ljava/util/List;[Ljava/lang/String;)Ljava/util/Map;+7
j  org.nd4j.autodiff.samediff.config.BatchOutputConfig.output()Ljava/util/Map;+28
j  org.nd4j.autodiff.samediff.SameDiff.output(Ljava/util/Map;[Ljava/lang/String;)Ljava/util/Map;+12
j  org.deeplearning4j.nn.layers.samediff.SameDiffLayer.activate(ZLorg/deeplearning4j/nn/workspace/LayerWorkspaceMgr;)Lorg/nd4j/linalg/api/ndarray/INDArray;+378
j  org.deeplearning4j.nn.layers.AbstractLayer.activate(Lorg/nd4j/linalg/api/ndarray/INDArray;ZLorg/deeplearning4j/nn/workspace/LayerWorkspaceMgr;)Lorg/nd4j/linalg/api/ndarray/INDArray;+9
j  org.deeplearning4j.nn.multilayer.MultiLayerNetwork.outputOfLayerDetached(ZLorg/deeplearning4j/nn/api/FwdPassType;ILorg/nd4j/linalg/api/ndarray/INDArray;Lorg/nd4j/linalg/api/ndarray/INDArray;Lorg/nd4j/linalg/api/ndarray/INDArray;Lorg/nd4j/linalg/api/memory/MemoryWorkspace;)Lorg/nd4j/linalg/api/ndarray/INDArray;+665
j  org.deeplearning4j.nn.multilayer.MultiLayerNetwork.output(Lorg/nd4j/linalg/api/ndarray/INDArray;ZLorg/nd4j/linalg/api/ndarray/INDArray;Lorg/nd4j/linalg/api/ndarray/INDArray;Lorg/nd4j/linalg/api/memory/MemoryWorkspace;)Lorg/nd4j/linalg/api/ndarray/INDArray;+18
j  org.deeplearning4j.nn.multilayer.MultiLayerNetwork.output(Lorg/nd4j/linalg/api/ndarray/INDArray;ZLorg/nd4j/linalg/api/ndarray/INDArray;Lorg/nd4j/linalg/api/ndarray/INDArray;)Lorg/nd4j/linalg/api/ndarray/INDArray;+7
j  org.deeplearning4j.nn.multilayer.MultiLayerNetwork.output(Lorg/nd4j/linalg/api/ndarray/INDArray;Z)Lorg/nd4j/linalg/api/ndarray/INDArray;+5
j  org.deeplearning4j.nn.multilayer.MultiLayerNetwork.output(Lorg/nd4j/linalg/api/ndarray/INDArray;Lorg/deeplearning4j/nn/api/Layer$TrainingMode;)Lorg/nd4j/linalg/api/ndarray/INDArray;+14
j  org.deeplearning4j.nn.multilayer.MultiLayerNetwork.output(Lorg/nd4j/linalg/api/ndarray/INDArray;)Lorg/nd4j/linalg/api/ndarray/INDArray;+5
j  info.skyblond.fiona.zhwiki.TestOnSIGHAN2015Kt$main$1.invoke(Ljava/lang/String;)V+515
j  info.skyblond.fiona.zhwiki.TestOnSIGHAN2015Kt$main$1.invoke(Ljava/lang/Object;)Ljava/lang/Object;+5
j  kotlin.io.TextStreamsKt.forEachLine(Ljava/io/Reader;Lkotlin/jvm/functions/Function1;)V+134
j  kotlin.io.FilesKt__FileReadWriteKt.forEachLine(Ljava/io/File;Ljava/nio/charset/Charset;Lkotlin/jvm/functions/Function1;)V+53
j  info.skyblond.fiona.zhwiki.TestOnSIGHAN2015Kt.main()V+170
j  info.skyblond.fiona.zhwiki.TestOnSIGHAN2015Kt.main([Ljava/lang/String;)V+0
v  ~StubRoutines::call_stub

siginfo: EXCEPTION_ACCESS_VIOLATION (0xc0000005), writing address 0x0000000000000050

Full version of hs_err: https://del.dog/apunimefir

cyberbeat · 2021-01-03T10:41:10Z

I also ran into something similar:

[pool-1-thread-3] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
[pool-1-thread-3] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 2
[pool-1-thread-3] INFO org.nd4j.nativeblas.Nd4jBlas - Number of threads used for OpenMP BLAS: 2
[pool-1-thread-3] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
[pool-1-thread-3] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [2]; Memory: [2,8GB];
[pool-1-thread-3] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
[pool-1-thread-3] INFO org.deeplearning4j.nn.graph.ComputationGraph - Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
[pool-1-thread-3] INFO org.deeplearning4j.optimize.listeners.PerformanceListener - ETL: 40 ms; iteration 0; iteration time: 10583 ms; samples/sec: 96,759; batches/sec: 0,094;
[pool-1-thread-3] INFO org.deeplearning4j.optimize.listeners.TimeIterationListener - Remaining time : 141mn - End expected : Sun Jan 03 11:40:28 CET 2021
..
..
[pool-1-thread-3] INFO org.deeplearning4j.optimize.listeners.PerformanceListener - ETL: 0 ms; iteration 799; iteration time: 988 ms; samples/sec: 15,182; batches/sec: 1,012; GC: [G1 Young Generation: 0 (0ms)], [G1 Old Generation: 0 (0ms)];
[pool-1-thread-3] INFO org.deeplearning4j.optimize.listeners.TimeIterationListener - Remaining time : 0mn - End expected : Sun Jan 03 11:09:10 CET 2021
[ERROR] Unknown dtypeX=32603 on /home/jenkins/agent/workspace/ing4j_deeplearning4j-1.0.0-beta7/libnd4j/include/legacy/cpu/NativeOpExecutioner.cpp:1355terminate called after throwing an instance of 'std::runtime_error'
what(): bad data type

agibsonccc · 2021-04-15T05:57:54Z

Closing this, MLNs and NDArrays typically are not safe objects to reference/pass around without proper protection. This is why we wrote parallelwrapper and parallelinference.
Make sure to pass OMP_NUM_THREADS=1 in a multi threaded environment to make execution more straightforward. If you have a particular use case for multi threading, we'd be happy to hear it. Otherwise for now, I'm closing this.

treo added the Bug Bugs and problems label May 27, 2020

cyberbeat mentioned this issue Jan 5, 2021

Add some xgboost internals access linkedin/dagli#3

Closed

agibsonccc added the LIBND4J label Apr 12, 2021

agibsonccc closed this as completed Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JVM Crash in heavily threaded environment #8977

JVM Crash in heavily threaded environment #8977

treo commented May 27, 2020

hurui200320 commented Oct 30, 2020

cyberbeat commented Jan 3, 2021

agibsonccc commented Apr 15, 2021

JVM Crash in heavily threaded environment #8977

JVM Crash in heavily threaded environment #8977

Comments

treo commented May 27, 2020

Issue Description

Version Information

hurui200320 commented Oct 30, 2020

cyberbeat commented Jan 3, 2021

agibsonccc commented Apr 15, 2021