New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JVM Crash in heavily threaded environment #8977
Comments
Same issue here. First time I notice that is jvm return a Codes are below. I marked out the crash section. Get the output of the network would cause the crash. Looks like there are random crash during the The issuse only shows when using CPU backend, using GPU backend is totally ok. package info.skyblond.fiona.zhwiki
import com.hankcs.hanlp.HanLP
import info.skyblond.fiona.dl4j.Word2VecHelper
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork
import org.nd4j.linalg.factory.Nd4j
import java.io.File
import java.nio.charset.StandardCharsets
fun main() {
val testRawInput = File("data/SIGHAN 2015 CSC Datasets/Test/SIGHAN15_CSC_TestInput.txt")
val model = MultiLayerNetwork.load(
File("checkpoint_38_MultiLayerNetwork.zip"),
true
)
testRawInput.forEachLine(StandardCharsets.UTF_8) {rawLine ->
val cleanString = // processed line content to be feed into network
// input data shape: [batchSize, features, timesteps]
// i.e.: [1, 300, 128]
val input = Nd4j.zeros(1, 300, 128)
cleanString.take(128).forEachIndexed { timeSteps, s ->
// result is a list of double, size is 300
val result = if (Word2VecHelper.containsKey(s)) Word2VecHelper[s] else Word2VecHelper["<unk>"]
result.forEachIndexed { featureIndex, d ->
input.putScalar(intArrayOf(1, featureIndex, timeSteps), d)
}
}
//---------------------- CRASH IN BELOW CODE ---------------------------
// 获得输出
// output: [batchSize, nOut, timesteps]
// i.e.: [1, 3, 128], after argmax(1), shape should be [1, 128]
val output = model.output(input).argMax(1) // CRASH HERE
val tagRawResult = mutableListOf<String>()
for (i in cleanString.indices){
val tag = when(output.getInt(1, i)) {
0 -> "O"
1 -> "B"
2 -> "I"
else -> "<unk>"
}
tagRawResult.add(tag)
}
//---------------------- CRASH IN ABOVE CODE --------------------------
// process into SIGHAN 2015 offical tool' format
// ......
} Network conf: val conf = NeuralNetConfiguration.Builder()
// training setting...
.list()
.layer(
0, Bidirectional(
Bidirectional.Mode.ADD,
LSTM.Builder()
.activation(Activation.TANH)
.nIn(300)
.nOut(hiddenLayerSize)
.build()
)
)
.layer(
1,
RecurrentAttentionLayer.Builder()
.nIn(hiddenLayerSize)
.nOut(feedforwardLayerSize1)
.projectInput(true)
.nHeads(attentionHeadNumber)
.build(),
)
.layer(
2, DenseLayer.Builder()
.nIn(feedforwardLayerSize1)
.nOut(feedforwardLayerSize2)
.activation(Activation.RELU)
.build()
)
.layer(
3,
RnnOutputLayer.Builder(LossFunctions.LossFunction.MCXENT)
.dropOut(0.0) // turn off dropout
.activation(Activation.SOFTMAX)
.nIn(feedforwardLayerSize2)
.nOut(numLabelClasses)
.build(),
)
.inputPreProcessor(2, RnnToFeedForwardPreProcessor())
.inputPreProcessor(3, FeedForwardToRnnPreProcessor())
.build() Parts of the hs_err:
Full version of hs_err: https://del.dog/apunimefir |
I also ran into something similar: [pool-1-thread-3] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend |
Closing this, MLNs and NDArrays typically are not safe objects to reference/pass around without proper protection. This is why we wrote parallelwrapper and parallelinference. |
Issue Description
While training an RL4J model, a user has reported JVM Crashes (https://community.konduit.ai/t/fatal-error-terminate-called-after-throwing-an-instance-of-std-runtime-error-what-bad-data-type/447/14)
I used the shared code to reproduce the issue, but I had to crank up the thread count to 128 in order to force the crash to appear early in the run.
The error logs of the user indicate that ConcMarkSweepGC was used. And with that option I had crashes like that:
And a few full on JVM crashes likes this:
It always was a different operation it crashed on, but they all had a common root:
I've also tried using SerialGC, with the result that the JVM still crashed, but with no useful error message.
As far as I can tell, the parallel training in this instance is using separate models and uses synchronized access to the shared model:
https://github.com/eclipse/deeplearning4j/blob/master/rl4j/rl4j-core/src/main/java/org/deeplearning4j/rl4j/learning/async/AsyncThreadDiscrete.java
Version Information
Please indicate relevant versions, including, if relevant:
The text was updated successfully, but these errors were encountered: