Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
NullPointerException in CudaDirectProvider.malloc when trying to create a diagonal matrix #1335
I'm trying to convert a vector into a diagonal matrix using the static method Nd4j.diag, and getting a NPE in the malloc method of CudaDirectProvider (stack trace below).
The input vector is an NDArray created from a Java primitive double.
I'm using cudnn 5.1, cuda 7.5, and a GTX-960.
I've set the data type as "double" via DataTypeUtil.setDTypeForContext(DataBuffer.Type.DOUBLE). (Half or float lead to a different error - cannot cast ShortPointer to FloatPointer in JcublasLevel2.java at line 52).
The code is running in a loop that traverses multiple member records. The first few dozen work fine, then one throws this exception. The underlying data for this record doesn't look any different from the other records. If I remove this record, the code will progress over a few more and then throw this exception again.
I've successfully reproduced this on an EC2 instance running the recently-announced DL4J AMI as well as my local environment.
Note this is not a neural network - just linear algebra. I'm only using ND4J - not DL4J or DataVec etc.
This particular exception comes from OOM and is already fixed on current
But could you please provide that cast exception, you've mentioned there?
10 окт. 2016 г. 4:02 пользователь "jgainesau" email@example.com
Oh I see, so what I'm seeing is actually an OutOfMemoryError manifesting itself in a funny way? Ok, seems plausible, OOMs can do weird things sometimes. I'll try to reduce the memory consumption.
But ... it is odd that the first 40-odd passes through the loop worked fine. The data structures on each pass are all the same size. I suppose if I've got a massive memory leak somewhere that could be causing it.
Ok, regarding the class cast exception. To trigger this, I run exactly the same code, but with one change:
Early on, instead of this:
I have this:
Here is the resulting stack trace:
In this case the error arises not in the diag call but in a later call to mmul, which is doing the dot product of two matrixes filled with double values.
I thought there might be some relationship between "ShortPointer" and the "HALF" DataBuffer type, and since the matrices are filled with doubles, I thought changing "HALF" to "DOUBLE" might fix the problem. It did - but it seems that perhaps all I accomplished was to get the code to run out of memory :-).
Thanks for taking a look.
Well, it's an NPE like the first version, but it happens right away like the second one:
It looks like the same oom issue. Can I see source code, that reproduces
PS. This issue is fixed on current master, so you can build from sources if
10 окт. 2016 г. 12:24 пользователь "jgainesau" firstname.lastname@example.org
Sure thing. It's Groovy.
Here's the loop that calculates latent factors for each member. The expression
This method computes the CX variable for each member. It creates a Java
It wouldn't surprise me in the slightest if this code had memory issues. It's a first pass, trying to get a feel for using Nd4j in a non-neural-network setting, as a linear algebra tool.
Having said that, it's still interesting to me that setting the context data type to
Ok so I reviewed my comment and decided to check to see if the underlying issue is OOM by the very simple expedient of taking a small slice of the full dataset. And guess what - it works fine!
So I'd say the underlying issue really truly is OOM.
And I'd better get to work optimizing that code ;-).
Yea, that's oom without any doubts.
10 окт. 2016 г. 13:48 пользователь "jgainesau" email@example.com