-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Windows GPU accuracy extremely bad #1228
Comments
Here is my output from train_mnist.py:
It looks fine. Did you try pulling the newest change and make clean && make? |
here is mine:
As you can see the accuracy remains at the 9% range, and even after the 10 epochs it remains the same, as far as the make part, I downloaded and installed pre-built package for gpu from here |
My output with exactly the same command on linux:
This seems to be a windows specific issue. @hjk41 Could you look into it? Mean while, @jonathanponce try using monitor (example in example/python-howto/monitor_weights.py) to check the internal weights and outputs to see if anything is wrong. |
Hey I used the monitor to check up on things and something is definitely happening, when I run the program using my cpu, things look quite normal
but when I use my gpu, most of the weights are zero, maybe they are being rounded off or something is wrong with the precision?
|
Could you try to do some simple arithmetic on gpu with x = mx.nd.zeros((10,10), ctx=mx.gpu(0))
x[:] = 1
x = x*2
print x.asnumpy() |
It returns an array of zeros, seems as if the operations are not taking place or are all returning zero
|
Could you try to run cuda's sample code for matrix multiply and see if the
|
I ran the sample code and everything seems to be ok
The results are as expected, seems to be something to do with mxnet |
I can't reproduce the problem locally so I can't think of anything now. |
I tried out the previous Windows build and it worked without a problem, so that means windows binary build 20160106 has a bug in the gpu computation section, there have been 29 commits since then so its possible that it has been fixed already. |
Even if it is just to back @jonathanponce, I have exactly the same problem. Running train_mnist.py without the --gpus 0 command gives an accuracy of about 0.97, but running with --gpus 0 gives an accuracy of about 0.07 I use Windows 7 64bit with Python 2.7 and have tried windows binary build 20160120 and windows binary build 20160113. Both have the same problem for me. |
@hjk41 Looks like gpu code is not running but not reporting error on windows with their cards. Could you look into it? |
@piiswrong I watched the gpu load with GPU-Z when running the mxnet code and the gpu load is around 25%, so the code is using my gpu. |
This post reports on the same issue: I ran into the same situation as well. Not sure yet if the earlier releases solve the problem. |
Same issue here with mxnet and python. I installed the latest windows build 20160202 and while training a network the accuracy wasn't increasing. The computation was taking place on the gpu because I checked it with gpu-z.... So I switched to the 20151228 build and now it works ok. So definately the bug from 20160106 still exists in 20160202. Hope it helps..... |
@piiswrong @Quares @JohanManders @gpapadop79 Meanwhile, could you help me narrow down the problem a little bit? Here are some speculations:
|
Just tried on another machine with Windows Server 2012R2, Python 2.7.10 x64, it also works fine. :-( |
Looks like it's caused by low cuda compute capability GPUs. |
Could be. I am running Titan. Does this also occur for low compute capability GPUs on Linux? |
I have a GTX 670 and when I boot into Ubuntu, mxnet works fine. In Windows I cannot get it to work. I ran some tests on my Windows 7 64bit, using windows binary build 20160216. Using a build earlier, does the same for me.
|
@jonathanponce So it is not related to compute capability, since both GTX670 and Titan have compute capability 3.0. Checkout the test branch and copy libmxnet.lib/libmxnet.dll to lib/windows/, then build the solution in windows/vs/MxNetTestApp/MxNetTestApp.sln with x64. The program just creates an NDArray on GPU, populate it with ones and then print it out. This is pretty much what mx.nd.ones((2,3), mx.gpu(0)) does. |
@hjk41 Did you want me to do the test? If so, I cloned the test branch, copied the dll and lib file (also needed the lib file) and build the solution successfully. I don't know what should happen or how long it should take, but running the program seems to do nothing. |
@JohanManders The program should output a series of digits from 0 to 5. If it prints nothing, then there must b something wrong. It means the problem also occurs for c++ programs. |
@hjk41 Mmm... Strange... Building CUDA samples like marchingCubes, matrixMulCUBLAS and particles seem to be no problem and run perfectly. |
I also ran matrixMulCUBLAS and it passes. My environment is Windows 7 x64 python 2.7.11 (Anaconda 2.5.0) and GTX 960 (which has compute capability 5.2) |
Thanks guys. I think I will have to reinstall one of my machines to use Windows 7 to reproduce the problem, which will need some time. Meanwhile, if someone can try to debug the problem, it would be great. With the C++ program, it shouldn't be too hard. |
I tested the windows binary build 20160223,it hasn't this bug. |
I tested 20160531_win2012_x64_gpu.7z,it hasn't this bug too |
@qggjonny Can you share your cudnn version, cuda version and GPU hardware? I tested 20160223_win10_x64_gpu.7z on cudnn v3, cuda 8.0 and GTX1080. In [1]: import mxnet as mx |
My cuda version is 7.5,GPU is GT730。 |
@qggjonny When I installed win2012 version and call |
I put cudnn64_70.dll in mxnet\3rdparty\cudnn and mxnet\3rdparty\cudnn\bin. |
I have the same problem with
|
I return the 20160223_win10_x64_gpu.7z, the GPU works also. But in the newest verision, it does not work... |
@jf003320018 I also tried 20160223_win10_x64_gpu.7z, but not work. I use cuda8.0 with cudnn3. |
@auroralinan Maybe you can try cudnn v3, it has cudnn64_70.dll |
@yunzhou I have the same environment with you and mxnet does not work either. |
@MaticsL I tried 20160223_win10_x64_gpu.7z + cuda7.5 + cudnn3. The gpu ones functions returns 0 |
@yunzhou my CUDA is 8.0 and cudnn is V3. Just following the readme 20160223_win10_x64_gpu.7z, it will work. |
@jf003320018 thanks. Since CUDA 7.0 can not recognize gtx 1080. I will return to CUDA8.0 and try 20160223_win10_x64_gpu.7z. |
i used to compile mxnet with CUDA 8.0 RC cudnn 5.1 opencv 3.0 mkl in windows 10 and had the this problem too |
Using R-package, I have this exact problem with: |
@MaticsL Thanks, do you know how I can build my GPU R-package from those? The folder 3rdparty seems to be missing from the 20161101_mxnet_x64_gpu for instance. Sorry if it's a dumb question. |
I have the same problem on windows 10 using |
It seems like this problem can't be solved... |
Actually compiling from the latest source code with the new CUDA 8.0.44, cuDNN 5.1 and VS2015 worked for me finally. The issue seems to be solved, most likely due to the new CUDA release. With CUDA 8.0.27 only the Debug version was working correctly. |
Thanks cemkeskin, but the problem still persists.. |
I don't want to be annoying, but can someone help me? I really need a solution |
To build MxNet from source, please follow the instructions here: The prebuilt binary sometimes have strange problems with different OS/CUDA On Wed, Nov 9, 2016 at 5:49 PM, No41Name notifications@github.com wrote:
|
When I use the
There is one thing to note that it requires |
I met the same problem when using python in Ubuntu16.0.4 LTS , with CUDA 8.0.44, CuDNN 5.1. And my GPU is Tesla K40c |
I am facing the simliar issue which is even more interesting. I have single binary runs twice and get totally different result. D:\mxnet\example\image-classification>python train_mnist.py --gpus 0 --network lenet I run again, without touch any file or reboot. even in same cmd promot: I bet this is due to some error related to initialization of weights. if i got another repro, i will try to debug. |
This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks! |
Hey i'm quite new to mxnet, I followed the installation instructions and succeeded in installing it on windows 8.1 64 bit, I then ran the train_mnist.py --network lenet without a problem, quite slow but the accuracy at the end is good at around 99.2, but when I run it as --network lenet --gpus 0 to use my gpu its definitely a lot faster but the accuracy never gets above 10% which is terrible, there must be something wrong theoretically it should be the same accuracy right? I installed cuda 7.5 and also extracted cuddn v3 just as indicated, everything runs without a problem except the accuracy is terrible, i'm running on a laptop with a nvidia 660m graphics card, it has compute capability 3.0.
After running the file I get Train-accuracy=0.098825
The text was updated successfully, but these errors were encountered: