Model trained with FP16 on GPU and saved does not load with CPU backend #2191

fac2003 · 2016-10-18T17:48:14Z

We've trained a model with the GPU backend, using FP16. When trying the model with the CPU backend, we get a warning that the precision does not match, followed by an exception (see below).

The warning is expected and in the direction FP16->FP32 a conversion should not result in a loss of precision (i.e., FP16 can be represented in FP32 without loss of precision, a cast should suffice).

However, it looks like the loading code is not aware of the size of the floats and loads them as if they were FP32. I have not looked at the code, but the EOF suggests 2 FP16 values are loaded in one FP32, resulting in reaching the end of stream before filling all parameters.

2016/10/18 13:28:08 3859 WARN [main ] [o.n.l.a.b.BaseDataBuffer] o.n.l.a.b.BaseDataBuffer - Loading a data stream with type different from what is set globally. Expect precision loss
java.lang.RuntimeException: java.io.EOFException
at org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1297)
at org.nd4j.linalg.factory.Nd4j.read(Nd4j.java:2223)
at org.deeplearning4j.util.ModelSerializer.restoreMultiLayerNetwork(ModelSerializer.java:183)
at org.deeplearning4j.util.ModelSerializer.restoreMultiLayerNetwork(ModelSerializer.java:266)
at org.campagnelab.dl.model.utils.models.ModelLoader.loadNativeModel(ModelLoader.java:109)
at org.campagnelab.dl.model.utils.models.ModelLoader.loadModel(ModelLoader.java:82)

Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1291)
... 11 more

raver119 · 2016-10-18T17:50:08Z

What nd4j version you're using? Something like that was fixed quite a while ago.

fac2003 · 2016-10-18T17:51:41Z

0.6.0 to train and load.

raver119 · 2016-10-18T17:52:26Z

Thanks. I'll check that once again.

fac2003 · 2016-10-18T17:55:28Z

I am also retraining with FP32 to confirm FP16 is the problem in this case.

raver119 · 2016-10-18T17:59:45Z

I'm more then sure it is. However, i'm also sure that stuff was covered with tests. But i'll check everything once again, no worries

fac2003 · 2016-10-19T18:38:09Z

I checked an a model trained on the GPU without FP16 can be loaded with the CPU backend.

raver119 · 2016-10-26T16:46:16Z

Confirming this as issue, fix is applied. Misleading warning is suppressed as well.

Thanks for highlighting this one.

… FP16 model parameters cannot be loaded with DL4J 0.6.0 CPU backend (see deeplearning4j/deeplearning4j#2191)

lock · 2019-01-20T17:56:28Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

raver119 self-assigned this Oct 18, 2016

raver119 added the Bug Bugs and problems label Oct 18, 2016

raver119 closed this as completed Oct 26, 2016

lock bot locked and limited conversation to collaborators Jan 20, 2019

eclipsewebmaster unassigned raver119 Jun 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model trained with FP16 on GPU and saved does not load with CPU backend #2191

Model trained with FP16 on GPU and saved does not load with CPU backend #2191

fac2003 commented Oct 18, 2016

raver119 commented Oct 18, 2016

fac2003 commented Oct 18, 2016

raver119 commented Oct 18, 2016

fac2003 commented Oct 18, 2016

raver119 commented Oct 18, 2016

fac2003 commented Oct 19, 2016

raver119 commented Oct 26, 2016

lock bot commented Jan 20, 2019

Model trained with FP16 on GPU and saved does not load with CPU backend #2191

Model trained with FP16 on GPU and saved does not load with CPU backend #2191

Comments

fac2003 commented Oct 18, 2016

raver119 commented Oct 18, 2016

fac2003 commented Oct 18, 2016

raver119 commented Oct 18, 2016

fac2003 commented Oct 18, 2016

raver119 commented Oct 18, 2016

fac2003 commented Oct 19, 2016

raver119 commented Oct 26, 2016

lock bot commented Jan 20, 2019