This repository has been archived by the owner. It is now read-only.

NullPointerException in CudaDirectProvider.malloc when trying to create a diagonal matrix #1335

Closed
jgainesau opened this Issue Oct 10, 2016 · 10 comments

Comments

Projects
None yet
2 participants
@jgainesau

jgainesau commented Oct 10, 2016

I'm trying to convert a vector into a diagonal matrix using the static method Nd4j.diag, and getting a NPE in the malloc method of CudaDirectProvider (stack trace below).

The input vector is an NDArray created from a Java primitive double[].

I'm using cudnn 5.1, cuda 7.5, and a GTX-960.

I've set the data type as "double" via DataTypeUtil.setDTypeForContext(DataBuffer.Type.DOUBLE). (Half or float lead to a different error - cannot cast ShortPointer to FloatPointer in JcublasLevel2.java at line 52).

The code is running in a loop that traverses multiple member records. The first few dozen work fine, then one throws this exception. The underlying data for this record doesn't look any different from the other records. If I remove this record, the code will progress over a few more and then throw this exception again.

I've successfully reproduced this on an EC2 instance running the recently-announced DL4J AMI as well as my local environment.

Note this is not a neural network - just linear algebra. I'm only using ND4J - not DL4J or DataVec etc.

2016-10-09 18:02:10,987 INFO c.p.r.OptimizeLatentFactors - Member 213830 (49)
2016-10-09 18:02:13,219 ERROR j.l.Throwable - Exception in thread "main" java.lang.NullPointerException
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.CudaDirectProvider.malloc(CudaDirectProvider.java:89)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.CudaCachingZeroProvider.malloc(CudaCachingZeroProvider.java:116)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.CudaFullCachingProvider.malloc(CudaFullCachingProvider.java:76)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.handler.impl.CudaZeroHandler.alloc(CudaZeroHandler.java:253)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.allocator.impl.AtomicAllocator.allocateMemory(AtomicAllocator.java:381)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.allocator.impl.AtomicAllocator.allocateMemory(AtomicAllocator.java:338)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.(BaseCudaDataBuffer.java:144)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.buffer.CudaDoubleDataBuffer.(CudaDoubleDataBuffer.java:59)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.buffer.factory.CudaDataBufferFactory.createDouble(CudaDataBufferFactory.java:241)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1282)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1252)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:249)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:286)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:563)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.JCublasNDArray.(JCublasNDArray.java:258)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.create(JCublasNDArrayFactory.java:224)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4369)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4331)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:3584)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j.diag(Nd4j.java:2506)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j.diag(Nd4j.java:2550)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j$diag$2.callStatic(Unknown Source)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at c.p.r.OptimizeLatentFactors.computeMemberConfidenceMatrix(OptimizeLatentFactors.groovy:146)

@raver119

This comment has been minimized.

Show comment
Hide comment
@raver119

raver119 Oct 10, 2016

Contributor

This particular exception comes from OOM and is already fixed on current
master.

But could you please provide that cast exception, you've mentioned there?

10 окт. 2016 г. 4:02 пользователь "jgainesau" notifications@github.com
написал:

I'm trying to convert a vector into a diagonal matrix using the static
method Nd4j.diag, and getting a NPE in the malloc method of
CudaDirectProvider (stack trace below).

The input vector is an NDArray created from a Java primitive double[].

I'm using cudnn 5.1, cuda 7.5, and a GTX-960.

I've set the data type as "double" via DataTypeUtil.
setDTypeForContext(DataBuffer.Type.DOUBLE). (Half or float lead to a
different error - cannot cast ShortPointer to FloatPointer in
JcublasLevel2.java at line 52).

The code is running in a loop that traverses multiple member records. The
first few dozen work fine, then one throws this exception. The underlying
data for this record doesn't look any different from the other records. If
I remove this record, the code will progress over a few more and then throw
this exception again.

I've successfully reproduced this on an EC2 instance running the
recently-announced DL4J AMI as well as my local environment.

Note this is not a neural network - just linear algebra. I'm only using
ND4J - not DL4J or DataVec etc.

2016-10-09 18:02:10,987 INFO c.p.r.OptimizeLatentFactors - Member 213830
(49)
2016-10-09 18:02:13,219 ERROR j.l.Throwable - Exception in thread "main"
java.lang.NullPointerException
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.
CudaDirectProvider.malloc(CudaDirectProvider.java:89)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.
CudaCachingZeroProvider.malloc(CudaCachingZeroProvider.java:116)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.
CudaFullCachingProvider.malloc(CudaFullCachingProvider.java:76)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.jita.handler.impl.CudaZeroHandler.alloc(CudaZeroHandler.java:253)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.jita.allocator.impl.AtomicAllocator.allocateMemory(
AtomicAllocator.java:381)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.jita.allocator.impl.AtomicAllocator.allocateMemory(
AtomicAllocator.java:338)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.
buffer.BaseCudaDataBuffer.(BaseCudaDataBuffer.java:144)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.
buffer.CudaDoubleDataBuffer.(CudaDoubleDataBuffer.java:59)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.
buffer.factory.CudaDataBufferFactory.createDouble(
CudaDataBufferFactory.java:241)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1282)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1252)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:249)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:286)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:563)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.
JCublasNDArray.(JCublasNDArray.java:258)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.
JCublasNDArrayFactory.create(JCublasNDArrayFactory.java:224)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4369)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4331)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:3584)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.diag(Nd4j.java:2506)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.diag(Nd4j.java:2550)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j$diag$2.callStatic(Unknown Source)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
c.p.r.OptimizeLatentFactors.computeMemberConfidenceMatrix(
OptimizeLatentFactors.groovy:146)


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1335, or mute the thread
https://github.com/notifications/unsubscribe-auth/ALru_zgzP7GYNMU_zHiabUdF71vDBc5Rks5qyY7AgaJpZM4KSKuF
.

Contributor

raver119 commented Oct 10, 2016

This particular exception comes from OOM and is already fixed on current
master.

But could you please provide that cast exception, you've mentioned there?

10 окт. 2016 г. 4:02 пользователь "jgainesau" notifications@github.com
написал:

I'm trying to convert a vector into a diagonal matrix using the static
method Nd4j.diag, and getting a NPE in the malloc method of
CudaDirectProvider (stack trace below).

The input vector is an NDArray created from a Java primitive double[].

I'm using cudnn 5.1, cuda 7.5, and a GTX-960.

I've set the data type as "double" via DataTypeUtil.
setDTypeForContext(DataBuffer.Type.DOUBLE). (Half or float lead to a
different error - cannot cast ShortPointer to FloatPointer in
JcublasLevel2.java at line 52).

The code is running in a loop that traverses multiple member records. The
first few dozen work fine, then one throws this exception. The underlying
data for this record doesn't look any different from the other records. If
I remove this record, the code will progress over a few more and then throw
this exception again.

I've successfully reproduced this on an EC2 instance running the
recently-announced DL4J AMI as well as my local environment.

Note this is not a neural network - just linear algebra. I'm only using
ND4J - not DL4J or DataVec etc.

2016-10-09 18:02:10,987 INFO c.p.r.OptimizeLatentFactors - Member 213830
(49)
2016-10-09 18:02:13,219 ERROR j.l.Throwable - Exception in thread "main"
java.lang.NullPointerException
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.
CudaDirectProvider.malloc(CudaDirectProvider.java:89)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.
CudaCachingZeroProvider.malloc(CudaCachingZeroProvider.java:116)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.
CudaFullCachingProvider.malloc(CudaFullCachingProvider.java:76)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.jita.handler.impl.CudaZeroHandler.alloc(CudaZeroHandler.java:253)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.jita.allocator.impl.AtomicAllocator.allocateMemory(
AtomicAllocator.java:381)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.jita.allocator.impl.AtomicAllocator.allocateMemory(
AtomicAllocator.java:338)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.
buffer.BaseCudaDataBuffer.(BaseCudaDataBuffer.java:144)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.
buffer.CudaDoubleDataBuffer.(CudaDoubleDataBuffer.java:59)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.
buffer.factory.CudaDataBufferFactory.createDouble(
CudaDataBufferFactory.java:241)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1282)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1252)
2016-10-09 18:02:13,220 ERROR j.l.Throwable - at
org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:249)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:286)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:563)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.
JCublasNDArray.(JCublasNDArray.java:258)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.
JCublasNDArrayFactory.create(JCublasNDArrayFactory.java:224)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4369)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4331)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:3584)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.diag(Nd4j.java:2506)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j.diag(Nd4j.java:2550)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
org.nd4j.linalg.factory.Nd4j$diag$2.callStatic(Unknown Source)
2016-10-09 18:02:13,221 ERROR j.l.Throwable - at
c.p.r.OptimizeLatentFactors.computeMemberConfidenceMatrix(
OptimizeLatentFactors.groovy:146)


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1335, or mute the thread
https://github.com/notifications/unsubscribe-auth/ALru_zgzP7GYNMU_zHiabUdF71vDBc5Rks5qyY7AgaJpZM4KSKuF
.

@jgainesau

This comment has been minimized.

Show comment
Hide comment
@jgainesau

jgainesau Oct 10, 2016

Oh I see, so what I'm seeing is actually an OutOfMemoryError manifesting itself in a funny way? Ok, seems plausible, OOMs can do weird things sometimes. I'll try to reduce the memory consumption.

But ... it is odd that the first 40-odd passes through the loop worked fine. The data structures on each pass are all the same size. I suppose if I've got a massive memory leak somewhere that could be causing it.

Ok, regarding the class cast exception. To trigger this, I run exactly the same code, but with one change:

Early on, instead of this:

DataTypeUtil.setDTypeForContext(DataBuffer.Type.DOUBLE)

I have this:

DataTypeUtil.setDTypeForContext(DataBuffer.Type.HALF)

Here is the resulting stack trace:

2016-10-10 17:10:41,788 ERROR j.l.Throwable - Exception in thread "main" java.lang.ClassCastException: org.bytedeco.javacpp.ShortPointer cannot be cast to org.bytedeco.javacpp.FloatPointer
2016-10-10 17:10:41,788 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.blas.JcublasLevel2.sgemv(JcublasLevel2.java:52)
2016-10-10 17:10:41,788 ERROR j.l.Throwable - at org.nd4j.linalg.api.blas.impl.BaseLevel2.gemv(BaseLevel2.java:51)
2016-10-10 17:10:41,788 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli(BaseNDArray.java:2697)
2016-10-10 17:10:41,788 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.mmul(BaseNDArray.java:2501)
...
2016-10-10 17:10:41,790 ERROR j.l.Throwable - at c.p.r.OptimizeLatentFactors.optimizeLatentFactors(OptimizeLatentFactors.groovy:94)

In this case the error arises not in the diag call but in a later call to mmul, which is doing the dot product of two matrixes filled with double values.

I thought there might be some relationship between "ShortPointer" and the "HALF" DataBuffer type, and since the matrices are filled with doubles, I thought changing "HALF" to "DOUBLE" might fix the problem. It did - but it seems that perhaps all I accomplished was to get the code to run out of memory :-).

Thanks for taking a look.

jgainesau commented Oct 10, 2016

Oh I see, so what I'm seeing is actually an OutOfMemoryError manifesting itself in a funny way? Ok, seems plausible, OOMs can do weird things sometimes. I'll try to reduce the memory consumption.

But ... it is odd that the first 40-odd passes through the loop worked fine. The data structures on each pass are all the same size. I suppose if I've got a massive memory leak somewhere that could be causing it.

Ok, regarding the class cast exception. To trigger this, I run exactly the same code, but with one change:

Early on, instead of this:

DataTypeUtil.setDTypeForContext(DataBuffer.Type.DOUBLE)

I have this:

DataTypeUtil.setDTypeForContext(DataBuffer.Type.HALF)

Here is the resulting stack trace:

2016-10-10 17:10:41,788 ERROR j.l.Throwable - Exception in thread "main" java.lang.ClassCastException: org.bytedeco.javacpp.ShortPointer cannot be cast to org.bytedeco.javacpp.FloatPointer
2016-10-10 17:10:41,788 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.blas.JcublasLevel2.sgemv(JcublasLevel2.java:52)
2016-10-10 17:10:41,788 ERROR j.l.Throwable - at org.nd4j.linalg.api.blas.impl.BaseLevel2.gemv(BaseLevel2.java:51)
2016-10-10 17:10:41,788 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli(BaseNDArray.java:2697)
2016-10-10 17:10:41,788 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.mmul(BaseNDArray.java:2501)
...
2016-10-10 17:10:41,790 ERROR j.l.Throwable - at c.p.r.OptimizeLatentFactors.optimizeLatentFactors(OptimizeLatentFactors.groovy:94)

In this case the error arises not in the diag call but in a later call to mmul, which is doing the dot product of two matrixes filled with double values.

I thought there might be some relationship between "ShortPointer" and the "HALF" DataBuffer type, and since the matrices are filled with doubles, I thought changing "HALF" to "DOUBLE" might fix the problem. It did - but it seems that perhaps all I accomplished was to get the code to run out of memory :-).

Thanks for taking a look.

@raver119

This comment has been minimized.

Show comment
Hide comment
@raver119

raver119 Oct 10, 2016

Contributor

Ah, right. cuBLAS do not provides Hgemv for halfs.

What about DataTypeUtil.setDTypeForContext(DataBuffer.Type.FLOAT)?

Contributor

raver119 commented Oct 10, 2016

Ah, right. cuBLAS do not provides Hgemv for halfs.

What about DataTypeUtil.setDTypeForContext(DataBuffer.Type.FLOAT)?

@jgainesau

This comment has been minimized.

Show comment
Hide comment
@jgainesau

jgainesau Oct 10, 2016

Well, it's an NPE like the first version, but it happens right away like the second one:

2016-10-10 18:18:18,651 INFO c.p.r.OptimizeLatentFactors - Member 213278 (0)
2016-10-10 18:18:23,125 ERROR j.l.Throwable - Exception in thread "main" java.lang.NullPointerException
2016-10-10 18:18:23,125 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.CudaDirectProvider.malloc(CudaDirectProvider.java:89)
2016-10-10 18:18:23,125 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.CudaCachingZeroProvider.malloc(CudaCachingZeroProvider.java:116)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.CudaFullCachingProvider.malloc(CudaFullCachingProvider.java:74)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.jita.handler.impl.CudaZeroHandler.alloc(CudaZeroHandler.java:253)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.jita.allocator.impl.AtomicAllocator.allocateMemory(AtomicAllocator.java:381)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.jita.allocator.impl.AtomicAllocator.allocateMemory(AtomicAllocator.java:338)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.(BaseCudaDataBuffer.java:144)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.buffer.CudaFloatDataBuffer.(CudaFloatDataBuffer.java:59)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.buffer.factory.CudaDataBufferFactory.createFloat(CudaDataBufferFactory.java:251)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1277)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:262)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.JCublasNDArray.(JCublasNDArray.java:114)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.createUninitialized(JCublasNDArrayFactory.java:229)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j.createUninitialized(Nd4j.java:4391)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.api.shape.Shape.toOffsetZeroCopyHelper(Shape.java:152)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.api.shape.Shape.toOffsetZeroCopy(Shape.java:108)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.dup(BaseNDArray.java:1498)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.JCublasNDArray.dup(JCublasNDArray.java:407)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.sub(BaseNDArray.java:2577)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at
2016-10-10 18:18:23,128 ERROR j.l.Throwable - at
...
c.p.r.OptimizeLatentFactors$_optimizeLatentFactors_closure4.doCall(OptimizeLatentFactors.groovy:98)

jgainesau commented Oct 10, 2016

Well, it's an NPE like the first version, but it happens right away like the second one:

2016-10-10 18:18:18,651 INFO c.p.r.OptimizeLatentFactors - Member 213278 (0)
2016-10-10 18:18:23,125 ERROR j.l.Throwable - Exception in thread "main" java.lang.NullPointerException
2016-10-10 18:18:23,125 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.CudaDirectProvider.malloc(CudaDirectProvider.java:89)
2016-10-10 18:18:23,125 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.CudaCachingZeroProvider.malloc(CudaCachingZeroProvider.java:116)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.jita.memory.impl.CudaFullCachingProvider.malloc(CudaFullCachingProvider.java:74)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.jita.handler.impl.CudaZeroHandler.alloc(CudaZeroHandler.java:253)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.jita.allocator.impl.AtomicAllocator.allocateMemory(AtomicAllocator.java:381)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.jita.allocator.impl.AtomicAllocator.allocateMemory(AtomicAllocator.java:338)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.(BaseCudaDataBuffer.java:144)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.buffer.CudaFloatDataBuffer.(CudaFloatDataBuffer.java:59)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.buffer.factory.CudaDataBufferFactory.createFloat(CudaDataBufferFactory.java:251)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1277)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:262)
2016-10-10 18:18:23,126 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.JCublasNDArray.(JCublasNDArray.java:114)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.createUninitialized(JCublasNDArrayFactory.java:229)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.factory.Nd4j.createUninitialized(Nd4j.java:4391)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.api.shape.Shape.toOffsetZeroCopyHelper(Shape.java:152)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.api.shape.Shape.toOffsetZeroCopy(Shape.java:108)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.dup(BaseNDArray.java:1498)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.jcublas.JCublasNDArray.dup(JCublasNDArray.java:407)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at org.nd4j.linalg.api.ndarray.BaseNDArray.sub(BaseNDArray.java:2577)
2016-10-10 18:18:23,127 ERROR j.l.Throwable - at
2016-10-10 18:18:23,128 ERROR j.l.Throwable - at
...
c.p.r.OptimizeLatentFactors$_optimizeLatentFactors_closure4.doCall(OptimizeLatentFactors.groovy:98)

@jgainesau

This comment has been minimized.

Show comment
Hide comment
@jgainesau

jgainesau Oct 10, 2016

... and the Nd4j operation is a subtraction.

jgainesau commented Oct 10, 2016

... and the Nd4j operation is a subtraction.

@raver119

This comment has been minimized.

Show comment
Hide comment
@raver119

raver119 Oct 10, 2016

Contributor

It looks like the same oom issue. Can I see source code, that reproduces
this issue?

PS. This issue is fixed on current master, so you can build from sources if
you want. But if that's real shortage of memory, things will be slow,
because host memory will be used in such cases

10 окт. 2016 г. 12:24 пользователь "jgainesau" notifications@github.com
написал:

... and the Nd4j operation is a subtraction.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1335 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALru_3QbYd38ko22itpCPRvnqc4b5JOCks5qygQ8gaJpZM4KSKuF
.

Contributor

raver119 commented Oct 10, 2016

It looks like the same oom issue. Can I see source code, that reproduces
this issue?

PS. This issue is fixed on current master, so you can build from sources if
you want. But if that's real shortage of memory, things will be slow,
because host memory will be used in such cases

10 окт. 2016 г. 12:24 пользователь "jgainesau" notifications@github.com
написал:

... and the Nd4j operation is a subtraction.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1335 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALru_3QbYd38ko22itpCPRvnqc4b5JOCks5qygQ8gaJpZM4KSKuF
.

@jgainesau

This comment has been minimized.

Show comment
Hide comment
@jgainesau

jgainesau Oct 10, 2016

Sure thing. It's Groovy.

Here's the loop that calculates latent factors for each member. The expression CX.sub(productsI) is what throws the exception. members and products are lists of int. alpha is just a double that provides some weighting. And productsI is the result of doing Nd4j.eye(numProducts):

members.eachWithIndex { member, i ->
        log.info "Member $member ($i)"
        final CX = computeMemberConfidenceMatrix(member, products, rentalRecords, alpha)
        final q1 = CX.sub(productsI)
       ...
  }

This method computes the CX variable for each member. It creates a Java float[] representing a member's rental (or not) of each product into an NDArray (vector), and then takes the diag of that vector:

INDArray computeMemberConfidenceMatrix(int member, SortedSet<Integer> products, List<List<Integer>> rentalRecords,
                                     double alpha) {
    final Map scoredRentals = rentalRecords.findAll { it[0] == member }.groupBy { it[1] }.collectEntries { productId, recs ->
        [(productId): recs*.getAt(3).sum()]
    }
    final float[] confidence = products.collect { product ->
        (float) (1 + (alpha * (scoredRentals.containsKey(product) ? scoredRentals.get(product) : 0)))
    }
    return Nd4j.diag(Nd4j.create(confidence))
}

The confidence variable has about 20k elements, so that diag (and productsI for that matter) is probably pretty big.

It wouldn't surprise me in the slightest if this code had memory issues. It's a first pass, trying to get a feel for using Nd4j in a non-neural-network setting, as a linear algebra tool.

Having said that, it's still interesting to me that setting the context data type to DataBuffer.Type.DOUBLE - which surely is the most memory-hungry way to go? - does actually seem to work for a few dozen complete passes through that main loop.

jgainesau commented Oct 10, 2016

Sure thing. It's Groovy.

Here's the loop that calculates latent factors for each member. The expression CX.sub(productsI) is what throws the exception. members and products are lists of int. alpha is just a double that provides some weighting. And productsI is the result of doing Nd4j.eye(numProducts):

members.eachWithIndex { member, i ->
        log.info "Member $member ($i)"
        final CX = computeMemberConfidenceMatrix(member, products, rentalRecords, alpha)
        final q1 = CX.sub(productsI)
       ...
  }

This method computes the CX variable for each member. It creates a Java float[] representing a member's rental (or not) of each product into an NDArray (vector), and then takes the diag of that vector:

INDArray computeMemberConfidenceMatrix(int member, SortedSet<Integer> products, List<List<Integer>> rentalRecords,
                                     double alpha) {
    final Map scoredRentals = rentalRecords.findAll { it[0] == member }.groupBy { it[1] }.collectEntries { productId, recs ->
        [(productId): recs*.getAt(3).sum()]
    }
    final float[] confidence = products.collect { product ->
        (float) (1 + (alpha * (scoredRentals.containsKey(product) ? scoredRentals.get(product) : 0)))
    }
    return Nd4j.diag(Nd4j.create(confidence))
}

The confidence variable has about 20k elements, so that diag (and productsI for that matter) is probably pretty big.

It wouldn't surprise me in the slightest if this code had memory issues. It's a first pass, trying to get a feel for using Nd4j in a non-neural-network setting, as a linear algebra tool.

Having said that, it's still interesting to me that setting the context data type to DataBuffer.Type.DOUBLE - which surely is the most memory-hungry way to go? - does actually seem to work for a few dozen complete passes through that main loop.

@jgainesau

This comment has been minimized.

Show comment
Hide comment
@jgainesau

jgainesau Oct 10, 2016

Ok so I reviewed my comment and decided to check to see if the underlying issue is OOM by the very simple expedient of taking a small slice of the full dataset. And guess what - it works fine!

So I'd say the underlying issue really truly is OOM.

And I'd better get to work optimizing that code ;-).

jgainesau commented Oct 10, 2016

Ok so I reviewed my comment and decided to check to see if the underlying issue is OOM by the very simple expedient of taking a small slice of the full dataset. And guess what - it works fine!

So I'd say the underlying issue really truly is OOM.

And I'd better get to work optimizing that code ;-).

@raver119

This comment has been minimized.

Show comment
Hide comment
@raver119

raver119 Oct 10, 2016

Contributor

Yea, that's oom without any doubts.

10 окт. 2016 г. 13:48 пользователь "jgainesau" notifications@github.com
написал:

Ok so I reviewed my comment and decided to check to see if the underlying
issue is OOM by the very simple expedient of taking a small slice of the
full dataset. And guess what - it works fine!

So I'd say the underlying issue really truly is OOM.

And I'd better get to work optimizing that code ;-).


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1335 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALru_6y0q0pT9YxBaHwExcb0W84Ws-ntks5qyhfigaJpZM4KSKuF
.

Contributor

raver119 commented Oct 10, 2016

Yea, that's oom without any doubts.

10 окт. 2016 г. 13:48 пользователь "jgainesau" notifications@github.com
написал:

Ok so I reviewed my comment and decided to check to see if the underlying
issue is OOM by the very simple expedient of taking a small slice of the
full dataset. And guess what - it works fine!

So I'd say the underlying issue really truly is OOM.

And I'd better get to work optimizing that code ;-).


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1335 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALru_6y0q0pT9YxBaHwExcb0W84Ws-ntks5qyhfigaJpZM4KSKuF
.

@raver119 raver119 added the wontfix label Oct 10, 2016

@raver119

This comment has been minimized.

Show comment
Hide comment
@raver119

raver119 Oct 10, 2016

Contributor

Closing this now, since original issue is already fixed on current master.

Contributor

raver119 commented Oct 10, 2016

Closing this now, since original issue is already fixed on current master.

@raver119 raver119 closed this Oct 10, 2016

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.