Windows GPU accuracy extremely bad #1228

jonathanponce · 2016-01-09T20:44:09Z

Hey i'm quite new to mxnet, I followed the installation instructions and succeeded in installing it on windows 8.1 64 bit, I then ran the train_mnist.py --network lenet without a problem, quite slow but the accuracy at the end is good at around 99.2, but when I run it as --network lenet --gpus 0 to use my gpu its definitely a lot faster but the accuracy never gets above 10% which is terrible, there must be something wrong theoretically it should be the same accuracy right? I installed cuda 7.5 and also extracted cuddn v3 just as indicated, everything runs without a problem except the accuracy is terrible, i'm running on a laptop with a nvidia 660m graphics card, it has compute capability 3.0.

After running the file I get Train-accuracy=0.098825

piiswrong · 2016-01-09T20:49:55Z

Here is my output from train_mnist.py:

2016-01-09 12:48:47,622 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='mlp', num_epochs=10, num_examples=60000)
[12:48:51] src/io/iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(128,784)
[12:48:52] src/io/iter_mnist.cc:91: MNISTIter: load 10000 images, shuffle=1, shape=(128,784)
2016-01-09 12:48:52,053 Node[0] Start training with [cpu(0)]
2016-01-09 12:48:53,105 Node[0] Epoch[0] Batch [50] Speed: 6447.52 samples/sec  Train-accuracy=0.686719
2016-01-09 12:48:53,829 Node[0] Epoch[0] Batch [100]    Speed: 8836.63 samples/sec  Train-accuracy=0.793828
2016-01-09 12:48:54,660 Node[0] Epoch[0] Batch [150]    Speed: 7707.90 samples/sec  Train-accuracy=0.836302
2016-01-09 12:48:55,366 Node[0] Epoch[0] Batch [200]    Speed: 9064.13 samples/sec  Train-accuracy=0.858555
2016-01-09 12:48:56,192 Node[0] Epoch[0] Batch [250]    Speed: 7749.72 samples/sec  Train-accuracy=0.873969
2016-01-09 12:48:57,027 Node[0] Epoch[0] Batch [300]    Speed: 7662.28 samples/sec  Train-accuracy=0.885052
2016-01-09 12:48:57,808 Node[0] Epoch[0] Batch [350]    Speed: 8206.58 samples/sec  Train-accuracy=0.893951
2016-01-09 12:48:58,552 Node[0] Epoch[0] Batch [400]    Speed: 8606.22 samples/sec  Train-accuracy=0.900723
2016-01-09 12:48:59,377 Node[0] Epoch[0] Batch [450]    Speed: 7758.36 samples/sec  Train-accuracy=0.906563

It looks fine. Did you try pulling the newest change and make clean && make?

jonathanponce · 2016-01-09T20:57:16Z

here is mine:


C:\mxnet\nocudnn\python\image-classification>D:\Python27\python.exe train_mnist.
py --network lenet --gpus 0
2016-01-09 20:52:15,706 Node[0] start with arguments Namespace(batch_size=128, d
ata_dir='mnist/', gpus='0', kv_store='local', load_epoch=None, lr=0.1, lr_factor
=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_ex
amples=60000)
[20:52:17] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 60000 images
, shuffle=1, shape=(128, 1, 28, 28)
[20:52:18] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 10000 images
, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 20:52:18,315 Node[0] Start training with [gpu(0)]
2016-01-09 20:52:20,598 Node[0] Epoch[0] Batch [50]     Speed: 4719.76 samples/s
ec      Train-accuracy=0.096719
2016-01-09 20:52:21,969 Node[0] Epoch[0] Batch [100]    Speed: 4668.13 samples/s
ec      Train-accuracy=0.098203
2016-01-09 20:52:23,334 Node[0] Epoch[0] Batch [150]    Speed: 4688.64 samples/s
ec      Train-accuracy=0.100625
2016-01-09 20:52:24,688 Node[0] Epoch[0] Batch [200]    Speed: 4723.25 samples/s
ec      Train-accuracy=0.100039
2016-01-09 20:52:26,042 Node[0] Epoch[0] Batch [250]    Speed: 4726.74 samples/s
ec      Train-accuracy=0.098344
2016-01-09 20:52:27,424 Node[0] Epoch[0] Batch [300]    Speed: 4634.32 samples/s
ec      Train-accuracy=0.099635
2016-01-09 20:52:28,793 Node[0] Epoch[0] Batch [350]    Speed: 4671.53 samples/s
ec      Train-accuracy=0.099955

As you can see the accuracy remains at the 9% range, and even after the 10 epochs it remains the same, as far as the make part, I downloaded and installed pre-built package for gpu from here
https://github.com/dmlc/mxnet/releases

piiswrong · 2016-01-09T22:20:13Z

My output with exactly the same command on linux:

python train_mnist.py --network lenet --gpus 0
2016-01-09 14:18:41,245 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus='0', kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_examples=60000)
[14:18:43] src/io/iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(128, 1, 28, 28)
[14:18:43] src/io/iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 14:18:43,402 Node[0] Start training with [gpu(0)]
2016-01-09 14:18:46,866 Node[0] Epoch[0] Batch [50] Speed: 2515.84 samples/sec  Train-accuracy=0.810000
2016-01-09 14:18:49,499 Node[0] Epoch[0] Batch [100]    Speed: 2431.10 samples/sec  Train-accuracy=0.876484
2016-01-09 14:18:52,040 Node[0] Epoch[0] Batch [150]    Speed: 2518.40 samples/sec  Train-accuracy=0.903073
2016-01-09 14:18:54,563 Node[0] Epoch[0] Batch [200]    Speed: 2537.25 samples/sec  Train-accuracy=0.918750
2016-01-09 14:18:57,251 Node[0] Epoch[0] Batch [250]    Speed: 2380.75 samples/sec  Train-accuracy=0.928750
2016-01-09 14:18:59,741 Node[0] Epoch[0] Batch [300]    Speed: 2570.31 samples/sec  Train-accuracy=0.936120
2016-01-09 14:19:02,343 Node[0] Epoch[0] Batch [350]    Speed: 2459.97 samples/sec  Train-accuracy=0.941897
2016-01-09 14:19:04,880 Node[0] Epoch[0] Batch [400]    Speed: 2523.58 samples/sec  Train-accuracy=0.946660
2016-01-09 14:19:07,560 Node[0] Epoch[0] Batch [450]    Speed: 2387.78 samples/sec  Train-accuracy=0.950122

This seems to be a windows specific issue. @hjk41 Could you look into it?

Mean while, @jonathanponce try using monitor (example in example/python-howto/monitor_weights.py) to check the internal weights and outputs to see if anything is wrong.

jonathanponce · 2016-01-09T22:35:42Z

Hey I used the monitor to check up on things and something is definitely happening, when I run the program using my cpu, things look quite normal


C:\mxnet\nocudnn\python\image-classification>D:\Python27\python.exe train_mnist.py --network lenet
2016-01-09 22:31:09,315 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, kv_store='lo
cal', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_exa
mples=60000)
[22:31:11] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(128, 1, 28, 28)
[22:31:11] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 22:31:11,933 Node[0] Start training with [cpu(0)]
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution0_output            0.32209
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation0_output             0.263409
2016-01-09 22:31:13,413 Node[0] Batch:       1 pooling0_output                0.264198
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution1_output            0.280998
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation1_output             0.259359
2016-01-09 22:31:13,413 Node[0] Batch:       1 pooling1_output                0.283388
2016-01-09 22:31:13,413 Node[0] Batch:       1 flatten0_output                0.283388
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected0_output         0.246848
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation2_output             0.23317
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected1_output         0.16215
2016-01-09 22:31:13,413 Node[0] Batch:       1 softmax_output                 0.101191
2016-01-09 22:31:13,413 Node[0] Batch:       1 softmax_backward_data          0.301412
2016-01-09 22:31:13,413 Node[0] Batch:       1 softmax_backward_label         0.0
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected1_backward_data  0.0376285
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected1_backward_weight 1.13253
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected1_backward_bias  3.8101
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation2_backward_data      0.0356833
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected0_backward_data  0.0252012
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected0_backward_weight 0.163174
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected0_backward_bias  0.458921
2016-01-09 22:31:13,413 Node[0] Batch:       1 flatten0_backward_data         0.0252012
2016-01-09 22:31:13,413 Node[0] Batch:       1 pooling1_backward_data         0.0126023
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation1_backward_data      0.0116884
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution1_backward_data     0.010943
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution1_backward_weight   0.494861
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution1_backward_bias     1.24864
2016-01-09 22:31:13,413 Node[0] Batch:       1 pooling0_backward_data         0.00705877
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation0_backward_data      0.00671425
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution0_backward_data     0.0251948
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution0_backward_weight   0.832047
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution0_backward_bias     4.85974
2016-01-09 22:31:13,428 Node[0] Batch:       1 data                           0.33463
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution0_weight            0.175653
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution0_bias              0.00379667
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution1_weight            0.0395973
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution1_bias              0.000975498
2016-01-09 22:31:13,428 Node[0] Batch:       1 fullyconnected0_weight         0.031241
2016-01-09 22:31:13,428 Node[0] Batch:       1 fullyconnected0_bias           0.000358532
2016-01-09 22:31:13,428 Node[0] Batch:       1 fullyconnected1_weight         0.0393582
2016-01-09 22:31:13,428 Node[0] Batch:       1 fullyconnected1_bias           0.00297664
2016-01-09 22:31:13,428 Node[0] Batch:       1 softmax_label                  5.14174

but when I use my gpu, most of the weights are zero, maybe they are being rounded off or something is wrong with the precision?

C:\mxnet\nocudnn\python\image-classification>D:\Python27\python.exe train_mnist.py --network lenet --gpus 0
2016-01-09 22:31:49,494 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus='0', kv_store='loc
al', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_exam
ples=60000)
[22:31:51] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(128, 1, 28, 28)
[22:31:52] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 22:31:52,048 Node[0] Start training with [gpu(0)]
2016-01-09 22:31:52,996 Node[0] Batch:       1 convolution0_output            0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 activation0_output             152988.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 pooling0_output                0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 convolution1_output            0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 activation1_output             32342.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 pooling1_output                0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 flatten0_output                0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected0_output         0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 activation2_output             0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected1_output         0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 softmax_output                 0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 softmax_backward_data          0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 softmax_backward_label         0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected1_backward_data  0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected1_backward_weight 0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected1_backward_bias  0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 activation2_backward_data      0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected0_backward_data  0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected0_backward_weight 0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected0_backward_bias  0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 flatten0_backward_data         0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 pooling1_backward_data         0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 activation1_backward_data      0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution1_backward_data     0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution1_backward_weight   0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution1_backward_bias     0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 pooling0_backward_data         0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 activation0_backward_data      0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution0_backward_data     0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution0_backward_weight   0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution0_backward_bias     0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 data                           0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution0_weight            0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution0_bias              0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution1_weight            0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution1_bias              0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected0_weight         39.2047
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected0_bias           0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected1_weight         0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected1_bias           390.408
2016-01-09 22:31:53,013 Node[0] Batch:       1 softmax_label                  0.0

piiswrong · 2016-01-10T01:34:59Z

Could you try to do some simple arithmetic on gpu with

x = mx.nd.zeros((10,10), ctx=mx.gpu(0))
x[:] = 1
x = x*2
print x.asnumpy()

jonathanponce · 2016-01-10T02:18:14Z

It returns an array of zeros, seems as if the operations are not taking place or are all returning zero

>>> print x.asnumpy()
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

piiswrong · 2016-01-10T02:42:02Z

Could you try to run cuda's sample code for matrix multiply and see if the
results are normal?
On Jan 9, 2016 6:18 PM, "jonathanponce" notifications@github.com wrote:

It returns an array of zeros, seems as if the operations are not taking
place or are all returning zero

print x.asnumpy()
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

—
Reply to this email directly or view it on GitHub
#1228 (comment).

jonathanponce · 2016-01-10T03:13:33Z

I ran the sample code and everything seems to be ok

[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GTX 660M" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 4.40 GFlop/s, Time= 29.805 msec, Size= 131072000 Ops, WorkgroupSize
= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may v
ary when GPU Boost is enabled.

The results are as expected, seems to be something to do with mxnet

piiswrong · 2016-01-10T04:05:45Z

I can't reproduce the problem locally so I can't think of anything now.
You can try git bisect https://git-scm.com/docs/git-bisect to see if it's a recently introduced bug.

jonathanponce · 2016-01-10T13:52:39Z

I tried out the previous Windows build and it worked without a problem, so that means windows binary build 20160106 has a bug in the gpu computation section, there have been 29 commits since then so its possible that it has been fixed already.

JohanManders · 2016-01-20T19:20:58Z

Even if it is just to back @jonathanponce, I have exactly the same problem. Running train_mnist.py without the --gpus 0 command gives an accuracy of about 0.97, but running with --gpus 0 gives an accuracy of about 0.07

I use Windows 7 64bit with Python 2.7 and have tried windows binary build 20160120 and windows binary build 20160113. Both have the same problem for me.

piiswrong · 2016-01-20T19:51:15Z

@hjk41 Looks like gpu code is not running but not reporting error on windows with their cards. Could you look into it?

JohanManders · 2016-01-20T20:00:50Z

@piiswrong I watched the gpu load with GPU-Z when running the mxnet code and the gpu load is around 25%, so the code is using my gpu.

Quares · 2016-02-01T10:37:45Z

This post reports on the same issue:
https://www.kaggle.com/c/second-annual-data-science-bowl/forums/t/18079/end-to-end-deep-learning-tutorial-0-0392/105458#post105458

I ran into the same situation as well. Not sure yet if the earlier releases solve the problem.

gpapadop79 · 2016-02-07T15:08:02Z

Same issue here with mxnet and python. I installed the latest windows build 20160202 and while training a network the accuracy wasn't increasing. The computation was taking place on the gpu because I checked it with gpu-z....
Did the simple arithmetic tests on gpu mentioned by @piiswrong and it gave me zeroes.

So I switched to the 20151228 build and now it works ok.

So definately the bug from 20160106 still exists in 20160202. Hope it helps.....

hjk41 · 2016-02-16T03:07:00Z

@piiswrong @Quares @JohanManders @gpapadop79
Sorry it take me so long to respond. I was fully occupied with an internal conference last few weeks. I just tried with 20160202 and simple test seems to work alright for me. I guess it must be something in the system configuration side. I am using Windows Server 2012 Datacenter, Python 2.7.10 x64. I will try to switch to some other platform and see if it works there.

Meanwhile, could you help me narrow down the problem a little bit? Here are some speculations:

run "where libmxnet.dll" and see if you are using the right version of libmxnet.dll
run matrixMulCuBLAS from nvidia CUDA samples and see if it works
try building mxnet from source and do the test again

Python 2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
OpenCV is unavailable.
>>> a = mx.nd.ones((2,3), mx.gpu(0))
>>> a.asnumpy()
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]], dtype=float32)
>>> x = mx.nd.zeros((10,10), ctx=mx.gpu(0))
>>> x[:] = 1
>>> x = x*2
>>> print x.asnumpy()
[[ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]]

hjk41 · 2016-02-16T04:42:27Z

Just tried on another machine with Windows Server 2012R2, Python 2.7.10 x64, it also works fine. :-(
I think I need some help here. It would be great if someone is willing to share a machine that can reproduce the problem.

piiswrong · 2016-02-16T04:54:21Z

Looks like it's caused by low cuda compute capability GPUs.

hjk41 · 2016-02-16T05:02:23Z

Could be. I am running Titan. Does this also occur for low compute capability GPUs on Linux?

JohanManders · 2016-02-16T08:54:02Z

I have a GTX 670 and when I boot into Ubuntu, mxnet works fine. In Windows I cannot get it to work.

I ran some tests on my Windows 7 64bit, using windows binary build 20160216. Using a build earlier, does the same for me.

libmxnet.dll is in the right place

C:\Users\XXXXX>where libmxnet.dll
C:\Anaconda\Lib\site-packages\mxnet-0.5.0-py2.7.egg\mxnet\libmxnet.dll

matrixMulCuBLAS passes

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release>matrixMulC
UBLAS.exe
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GeForce GTX 670" with compute capability 3.0`

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 1059.89 GFlop/s, Time= 0.185 msec, Size= 196608000 Ops
Computing result using host CPU...done.

but mxnet gives me 0. 0. 0.

Python 2.7.11 |Anaconda 2.3.0 (64-bit)| (default, Jan 29 2016, 14:26:21) [MSC v.
1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import mxnet as mx
>>> a = mx.nd.ones((2,3), mx.gpu(0))
>>> a.asnumpy()
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]], dtype=float32)
>>> x = mx.nd.zeros((10,10), ctx=mx.gpu(0))
>>> x[:] = 1
>>> x = x*2
>>> print x.asnumpy()
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]
>>>

hjk41 · 2016-02-16T09:49:29Z

@jonathanponce So it is not related to compute capability, since both GTX670 and Titan have compute capability 3.0.
Could you try to run a C++ program? You can try to this one:
https://github.com/hjk41/MxNet.cpp.git

Checkout the test branch and copy libmxnet.lib/libmxnet.dll to lib/windows/, then build the solution in windows/vs/MxNetTestApp/MxNetTestApp.sln with x64. The program just creates an NDArray on GPU, populate it with ones and then print it out. This is pretty much what mx.nd.ones((2,3), mx.gpu(0)) does.

JohanManders · 2016-02-16T10:11:28Z

@hjk41 Did you want me to do the test? If so, I cloned the test branch, copied the dll and lib file (also needed the lib file) and build the solution successfully. I don't know what should happen or how long it should take, but running the program seems to do nothing.

hjk41 · 2016-02-16T10:28:44Z

@JohanManders The program should output a series of digits from 0 to 5. If it prints nothing, then there must b something wrong. It means the problem also occurs for c++ programs.

JohanManders · 2016-02-16T10:35:49Z

@hjk41 Mmm... Strange... Building CUDA samples like marchingCubes, matrixMulCUBLAS and particles seem to be no problem and run perfectly.

gpapadop79 · 2016-02-16T20:52:32Z

I also ran matrixMulCUBLAS and it passes.

My environment is Windows 7 x64 python 2.7.11 (Anaconda 2.5.0) and GTX 960 (which has compute capability 5.2)

hjk41 · 2016-02-17T01:51:07Z

Thanks guys. I think I will have to reinstall one of my machines to use Windows 7 to reproduce the problem, which will need some time. Meanwhile, if someone can try to debug the problem, it would be great. With the C++ program, it shouldn't be too hard.

qggjonny · 2016-09-27T15:58:34Z

I tested the windows binary build 20160223,it hasn't this bug.

qggjonny · 2016-09-28T18:20:33Z

I tested 20160531_win2012_x64_gpu.7z,it hasn't this bug too

yunzhou · 2016-10-12T11:10:36Z

@qggjonny Can you share your cudnn version, cuda version and GPU hardware?

I tested 20160223_win10_x64_gpu.7z on cudnn v3, cuda 8.0 and GTX1080.
I tested 20160223_win10_x64_gpu.7z on cudnn v5.1 cuda 8.0 and GTX1080.
Still not working.

In [1]: import mxnet as mx
...: a=mx.nd.ones((2, 3), mx.gpu())
...: print(a.asnumpy())
...:
...: b=mx.nd.ones((2, 3), mx.cpu())
...: print(b.asnumpy())
...:
[[ 0. 0. 0.]
[ 0. 0. 0.]]
[[ 1. 1. 1.]
[ 1. 1. 1.]]

qggjonny · 2016-10-12T17:00:18Z

My cuda version is 7.5,GPU is GT730。
Perhaps you can try 20160531_win2012_x64_gpu.7z,It can run on win10

auroralinan · 2016-10-13T01:06:30Z

@qggjonny When I installed win2012 version and call import mxnet in python console, it tells me that missing cudnn64_70.dll, but I have put my cudnn in the correct folder and cudnn v5 & v5.1 are both named cudnn64_5.dll, can you look at the 3rdparty folder and see what is your cudnn version?

qggjonny · 2016-10-13T04:16:23Z

I put cudnn64_70.dll in mxnet\3rdparty\cudnn and mxnet\3rdparty\cudnn\bin.

MaticsL · 2016-10-13T16:17:25Z

I have the same problem with 20160531_win10_x64_gpu and also with 20160531_win2012_x64_gpu:

In [3]: (mxnet.nd.ones((2,2), mxnet.cpu())*100).asnumpy()
Out[3]:
array([[ 100.,  100.],
       [ 100.,  100.]], dtype=float32)

In [4]: (mxnet.nd.ones((2,2), mxnet.gpu())*100).asnumpy()
Out[4]:
array([[ 0.,  0.],
       [ 0.,  0.]], dtype=float32)

jf003320018 · 2016-10-14T14:38:27Z

I return the 20160223_win10_x64_gpu.7z, the GPU works also. But in the newest verision, it does not work...

yunzhou · 2016-10-15T11:21:58Z

@jf003320018 I also tried 20160223_win10_x64_gpu.7z, but not work. I use cuda8.0 with cudnn3.
What is your cuda and cudnn?

yunzhou · 2016-10-15T11:25:00Z

@auroralinan Maybe you can try cudnn v3, it has cudnn64_70.dll

MaticsL · 2016-10-15T11:28:48Z

@yunzhou I have the same environment with you and mxnet does not work either.

yunzhou · 2016-10-15T11:36:11Z

@MaticsL I tried 20160223_win10_x64_gpu.7z + cuda7.5 + cudnn3. The gpu ones functions returns 0
Also my hardware is gtx1080.
I also notice that even if I put nothing under 3rdparty\cudnn 20160223_win10_x64_gpu, it still runs, but the gpu ones function returns 0.
I will try cuda 7.0 later

jf003320018 · 2016-10-18T01:02:47Z

@yunzhou my CUDA is 8.0 and cudnn is V3. Just following the readme 20160223_win10_x64_gpu.7z, it will work.

yunzhou · 2016-10-18T05:07:33Z

@jf003320018 thanks. Since CUDA 7.0 can not recognize gtx 1080. I will return to CUDA8.0 and try 20160223_win10_x64_gpu.7z.

nsndimt · 2016-10-30T15:58:16Z

i used to compile mxnet with CUDA 8.0 RC cudnn 5.1 opencv 3.0 mkl in windows 10 and had the this problem too
now, i change CUDA 8.0 RC to CUDA 8.0 and find no problem
P.S i use gtx 1060

sanson87 · 2016-11-01T10:03:47Z

Using R-package, I have this exact problem with:
Windows 10
CUDA 8.0
cudnn V3
GTX 1060
20160531_win10_x64_gpu as well as 20160223_win10_x64_gpu
Does anyone have a reliable solution ?

MaticsL · 2016-11-01T10:07:12Z

@sanson87 You can use prebuild version at #2813

sanson87 · 2016-11-01T10:31:08Z

@MaticsL Thanks, do you know how I can build my GPU R-package from those? The folder 3rdparty seems to be missing from the 20161101_mxnet_x64_gpu for instance. Sorry if it's a dumb question.

No41Name · 2016-11-04T16:02:23Z

I have the same problem on windows 10 using
CUDA 8.0
cudnn V3
GeForce 840M
I have tried both 20160531_win10_x64_gpu and 20161104_win10_x64_gpu
but running on GPU the train accuracy remaine always fixed.
Which version can I try to solve this problem?
Thanks.

No41Name · 2016-11-04T17:09:31Z

It seems like this problem can't be solved...

cemkeskin · 2016-11-04T20:00:01Z

Actually compiling from the latest source code with the new CUDA 8.0.44, cuDNN 5.1 and VS2015 worked for me finally. The issue seems to be solved, most likely due to the new CUDA release. With CUDA 8.0.27 only the Debug version was working correctly.

No41Name · 2016-11-06T16:57:10Z

Thanks cemkeskin, but the problem still persists..
I'm using the same versions you mentioned (CUDA 8.0.44, cuDNN 5.1, VS2015), but using the R-Package folder from the GitHub repository (precisely, this: https://github.com/dmlc/mxnet/) and editing with the windows binary build 20160531 (see the first attachment) i get this error when try to installing (see the second attachment). I tried to editing the "NAMESPACE" file removing those names that can't be found, and the installation seems to work, but when i run the models in R under the GPU the train accuracy remains fixed after the second epoch.
Any suggestion? What do you mean with "compiling from the source code"?

No41Name · 2016-11-09T09:49:11Z

I don't want to be annoying, but can someone help me? I really need a solution

hjk41 · 2016-11-10T01:40:00Z

To build MxNet from source, please follow the instructions here:
http://mxnet.io/get_started/setup.html#build-mxnet-on-windows

The prebuilt binary sometimes have strange problems with different OS/CUDA
configurations. Building from source usually solves the problem

On Wed, Nov 9, 2016 at 5:49 PM, No41Name notifications@github.com wrote:

I don't want to be annoying, but can someone help me? I really need a
solution

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1228 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABFI4Zv4-sAyjuVyF59GgkeqUHK0w9K-ks5q8ZcYgaJpZM4HBwc2
.

LinGuanfu · 2016-12-07T18:49:41Z

When I use the 20160531_win10_x64_gpu.7z, I have same problem on :
Windows8.1, CUDA 8.0, cuDNN v3 or v5, VS 2015, GTX 860m.
But I do solve this problem by using this https://github.com/yajiedesign/mxnet/releases version, I follow actions:

Uninstall mxnet from python. (if you already install it.)
Download the 20161125_mxnet_x64_gpu.7z , unpack it, copy all files into path you previous unpack 20160531_win10_x64_gpu.7z and replace those there.
Do download the cuDNNv4 and unpack it into .../3rdparty/cudnn/
Run the setupenv.cmd or set the enviornment path.
cd to .../python and type python setup.py install.

There is one thing to note that it requires cudnn64_4.dll which belongs to cuDNNv4.

noahzn · 2016-12-10T02:26:51Z

I met the same problem when using python in Ubuntu16.0.4 LTS , with CUDA 8.0.44, CuDNN 5.1. And my GPU is Tesla K40c
This also happen when I use CPU to compute, but it doesn't happen all the time.
@piiswrong
It seems that lenet.py causes this low accuracy, both in CPU and GPU modes.
If I use mlp, it works well.

howard0su · 2016-12-20T02:48:24Z

I am facing the simliar issue which is even more interesting. I have single binary runs twice and get totally different result.

D:\mxnet\example\image-classification>python train_mnist.py --gpus 0 --network lenet
INFO:root:start with arguments Namespace(batch_size=64, disp_batches=100, gpus='0', kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, network='lenet', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='sgd', test_io=0, top_k=0, wd=0.0001)
WARNING:root:�[91m[Deprecation Warning] mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.�[0m
INFO:root:Start training with [gpu(0)]
INFO:root:Epoch[0] Batch [100] Speed: 1418.34 samples/sec Train-accuracy=0.116719
INFO:root:Epoch[0] Batch [200] Speed: 1402.49 samples/sec Train-accuracy=0.101875
INFO:root:Epoch[0] Batch [300] Speed: 1404.02 samples/sec Train-accuracy=0.092188
INFO:root:Epoch[0] Batch [400] Speed: 1404.78 samples/sec Train-accuracy=0.097656
INFO:root:Epoch[0] Batch [500] Speed: 1400.95 samples/sec Train-accuracy=0.107188

I run again, without touch any file or reboot. even in same cmd promot:
D:\mxnet\example\image-classification>python train_mnist.py --gpus 0 --network lenet
INFO:root:start with arguments Namespace(batch_size=64, disp_batches=100, gpus='0', kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, network='lenet', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='sgd', test_io=0, top_k=0, wd=0.0001)
WARNING:root:�[91m[Deprecation Warning] mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.�[0m
INFO:root:Start training with [gpu(0)]
INFO:root:Epoch[0] Batch [100] Speed: 1418.85 samples/sec Train-accuracy=0.823750
INFO:root:Epoch[0] Batch [200] Speed: 1407.11 samples/sec Train-accuracy=0.915937
INFO:root:Epoch[0] Batch [300] Speed: 1405.28 samples/sec Train-accuracy=0.939063
INFO:root:Epoch[0] Batch [400] Speed: 1405.67 samples/sec Train-accuracy=0.948281
INFO:root:Epoch[0] Batch [500] Speed: 1390.74 samples/sec Train-accuracy=0.953594
INFO:root:Epoch[0] Batch [600] Speed: 1344.73 samples/sec Train-accuracy=0.957031
INFO:root:Epoch[0] Batch [700] Speed: 1345.86 samples/sec Train-accuracy=0.955781
INFO:root:Epoch[0] Batch [800] Speed: 1380.60 samples/sec Train-accuracy=0.960781
INFO:root:Epoch[0] Batch [900] Speed: 1350.45 samples/sec Train-accuracy=0.962344

I bet this is due to some error related to initialization of weights. if i got another repro, i will try to debug.

yajiedesign · 2017-09-28T07:16:40Z

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

Quares mentioned this issue Feb 2, 2016

GPU based mxnet is not accessible in windows #1358

Closed

hjk41 mentioned this issue Feb 16, 2016

Issue installing mxnet R-package using gpu on Windows #1336

Closed

hjk41 changed the title ~~GPU accuracy extremely bad~~ Windows GPU accuracy extremely bad Feb 16, 2016

hjk41 mentioned this issue Feb 16, 2016

Breaking changes ? In general or possibly windows binary build. #1266

Closed

howard0su mentioned this issue Dec 20, 2016

train_mnist give different result between two runs #4306

Closed

yajiedesign closed this as completed Sep 28, 2017

Windows GPU accuracy extremely bad #1228

Windows GPU accuracy extremely bad #1228

Comments

jonathanponce commented Jan 9, 2016

piiswrong commented Jan 9, 2016

jonathanponce commented Jan 9, 2016

piiswrong commented Jan 9, 2016

jonathanponce commented Jan 9, 2016

piiswrong commented Jan 10, 2016

jonathanponce commented Jan 10, 2016

piiswrong commented Jan 10, 2016

jonathanponce commented Jan 10, 2016

piiswrong commented Jan 10, 2016

jonathanponce commented Jan 10, 2016

JohanManders commented Jan 20, 2016

piiswrong commented Jan 20, 2016

JohanManders commented Jan 20, 2016

Quares commented Feb 1, 2016

gpapadop79 commented Feb 7, 2016

hjk41 commented Feb 16, 2016

hjk41 commented Feb 16, 2016

piiswrong commented Feb 16, 2016

hjk41 commented Feb 16, 2016

JohanManders commented Feb 16, 2016

hjk41 commented Feb 16, 2016

JohanManders commented Feb 16, 2016

hjk41 commented Feb 16, 2016

JohanManders commented Feb 16, 2016

gpapadop79 commented Feb 16, 2016

hjk41 commented Feb 17, 2016

qggjonny commented Sep 27, 2016 • edited Loading

qggjonny commented Sep 28, 2016

yunzhou commented Oct 12, 2016 • edited Loading

qggjonny commented Oct 12, 2016 • edited Loading

auroralinan commented Oct 13, 2016

qggjonny commented Oct 13, 2016

MaticsL commented Oct 13, 2016

jf003320018 commented Oct 14, 2016

yunzhou commented Oct 15, 2016

yunzhou commented Oct 15, 2016

MaticsL commented Oct 15, 2016

yunzhou commented Oct 15, 2016

jf003320018 commented Oct 18, 2016

yunzhou commented Oct 18, 2016

nsndimt commented Oct 30, 2016

sanson87 commented Nov 1, 2016

MaticsL commented Nov 1, 2016

sanson87 commented Nov 1, 2016

No41Name commented Nov 4, 2016

No41Name commented Nov 4, 2016

cemkeskin commented Nov 4, 2016

No41Name commented Nov 6, 2016

No41Name commented Nov 9, 2016

hjk41 commented Nov 10, 2016

LinGuanfu commented Dec 7, 2016 • edited Loading

noahzn commented Dec 10, 2016 • edited Loading

howard0su commented Dec 20, 2016

yajiedesign commented Sep 28, 2017

qggjonny commented Sep 27, 2016 •

edited

Loading

yunzhou commented Oct 12, 2016 •

edited

Loading

qggjonny commented Oct 12, 2016 •

edited

Loading

LinGuanfu commented Dec 7, 2016 •

edited

Loading

noahzn commented Dec 10, 2016 •

edited

Loading