Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model trained with FP16 on GPU and saved does not load with CPU backend #2191

Closed
fac2003 opened this issue Oct 18, 2016 · 8 comments
Closed
Labels
Bug Bugs and problems

Comments

@fac2003
Copy link

fac2003 commented Oct 18, 2016

We've trained a model with the GPU backend, using FP16. When trying the model with the CPU backend, we get a warning that the precision does not match, followed by an exception (see below).

The warning is expected and in the direction FP16->FP32 a conversion should not result in a loss of precision (i.e., FP16 can be represented in FP32 without loss of precision, a cast should suffice).

However, it looks like the loading code is not aware of the size of the floats and loads them as if they were FP32. I have not looked at the code, but the EOF suggests 2 FP16 values are loaded in one FP32, resulting in reaching the end of stream before filling all parameters.

2016/10/18 13:28:08 3859 WARN [main ] [o.n.l.a.b.BaseDataBuffer] o.n.l.a.b.BaseDataBuffer - Loading a data stream with type different from what is set globally. Expect precision loss
java.lang.RuntimeException: java.io.EOFException
at org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1297)
at org.nd4j.linalg.factory.Nd4j.read(Nd4j.java:2223)
at org.deeplearning4j.util.ModelSerializer.restoreMultiLayerNetwork(ModelSerializer.java:183)
at org.deeplearning4j.util.ModelSerializer.restoreMultiLayerNetwork(ModelSerializer.java:266)
at org.campagnelab.dl.model.utils.models.ModelLoader.loadNativeModel(ModelLoader.java:109)
at org.campagnelab.dl.model.utils.models.ModelLoader.loadModel(ModelLoader.java:82)

Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1291)
... 11 more

@raver119
Copy link
Contributor

What nd4j version you're using? Something like that was fixed quite a while ago.

@raver119 raver119 self-assigned this Oct 18, 2016
@raver119 raver119 added the Bug Bugs and problems label Oct 18, 2016
@fac2003
Copy link
Author

fac2003 commented Oct 18, 2016

0.6.0 to train and load.

@raver119
Copy link
Contributor

Thanks. I'll check that once again.

@fac2003
Copy link
Author

fac2003 commented Oct 18, 2016

I am also retraining with FP32 to confirm FP16 is the problem in this case.

@raver119
Copy link
Contributor

I'm more then sure it is. However, i'm also sure that stuff was covered with tests. But i'll check everything once again, no worries

@fac2003
Copy link
Author

fac2003 commented Oct 19, 2016

I checked an a model trained on the GPU without FP16 can be loaded with the CPU backend.

@raver119
Copy link
Contributor

Confirming this as issue, fix is applied. Misleading warning is suppressed as well.

Thanks for highlighting this one.

fac2003 added a commit to CampagneLaboratory/variationanalysis that referenced this issue Dec 10, 2016
… FP16 model parameters cannot be loaded with DL4J 0.6.0 CPU backend (see deeplearning4j/deeplearning4j#2191)
@lock
Copy link

lock bot commented Jan 20, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Jan 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug Bugs and problems
Projects
None yet
Development

No branches or pull requests

2 participants