New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model trained with FP16 on GPU and saved does not load with CPU backend #2191
Comments
What nd4j version you're using? Something like that was fixed quite a while ago. |
0.6.0 to train and load. |
Thanks. I'll check that once again. |
I am also retraining with FP32 to confirm FP16 is the problem in this case. |
I'm more then sure it is. However, i'm also sure that stuff was covered with tests. But i'll check everything once again, no worries |
I checked an a model trained on the GPU without FP16 can be loaded with the CPU backend. |
Confirming this as issue, fix is applied. Misleading warning is suppressed as well. Thanks for highlighting this one. |
… FP16 model parameters cannot be loaded with DL4J 0.6.0 CPU backend (see deeplearning4j/deeplearning4j#2191)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
We've trained a model with the GPU backend, using FP16. When trying the model with the CPU backend, we get a warning that the precision does not match, followed by an exception (see below).
The warning is expected and in the direction FP16->FP32 a conversion should not result in a loss of precision (i.e., FP16 can be represented in FP32 without loss of precision, a cast should suffice).
However, it looks like the loading code is not aware of the size of the floats and loads them as if they were FP32. I have not looked at the code, but the EOF suggests 2 FP16 values are loaded in one FP32, resulting in reaching the end of stream before filling all parameters.
2016/10/18 13:28:08 3859 WARN [main ] [o.n.l.a.b.BaseDataBuffer] o.n.l.a.b.BaseDataBuffer - Loading a data stream with type different from what is set globally. Expect precision loss
java.lang.RuntimeException: java.io.EOFException
at org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1297)
at org.nd4j.linalg.factory.Nd4j.read(Nd4j.java:2223)
at org.deeplearning4j.util.ModelSerializer.restoreMultiLayerNetwork(ModelSerializer.java:183)
at org.deeplearning4j.util.ModelSerializer.restoreMultiLayerNetwork(ModelSerializer.java:266)
at org.campagnelab.dl.model.utils.models.ModelLoader.loadNativeModel(ModelLoader.java:109)
at org.campagnelab.dl.model.utils.models.ModelLoader.loadModel(ModelLoader.java:82)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1291)
... 11 more
The text was updated successfully, but these errors were encountered: