Skip to content
This repository has been archived by the owner on May 24, 2018. It is now read-only.

GPU does not train for kaggle_bowl example #28

Closed
hothHowler opened this issue Jan 6, 2015 · 1 comment
Closed

GPU does not train for kaggle_bowl example #28

hothHowler opened this issue Jan 6, 2015 · 1 comment

Comments

@hothHowler
Copy link

I've been able to run the kaggle_bowl example but the GPU runs report a training-error that remains flat. I run the same bowl.conf only with dev=cpu and the training seems to begin to train fine. (I've appended the output of both below). I haven't otherwise changed bowl.conf. I did run the GPU on the MNIST example and it worked fine.

Do you have an idea of what might be going on here? Thanks!


Use CUDA Device 0: Tesla C1060
CXXNetTrainer, devCPU=0
ConvolutionLayer: nstep=256
ConvolutionLayer: nstep=256
ConvolutionLayer: nstep=128
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
node[0].shape: 256,3,48,48
node[1].shape: 256,96,12,12
node[2].shape: 256,96,12,12
node[3].shape: 256,96,6,6
node[4].shape: 256,128,8,8
node[5].shape: 256,128,8,8
node[6].shape: 256,128,8,8
node[7].shape: 256,128,8,8
node[8].shape: 256,128,4,4
node[9].shape: 1,1,256,2048
node[10].shape: 1,1,256,512
node[11].shape: 1,1,256,512
node[12].shape: 1,1,256,512
node[13].shape: 1,1,256,512
node[14].shape: 1,1,256,121
ThreadImagePageIterator:image_list=./train.lst, bin=./train.bin
loading mean image from models/image_mean.bin
ThreadBufferIterator: buffer_size=2
ThreadImagePageIterator:image_list=./train.lst, bin=./train.bin
loading mean image from models/image_mean.bin
ThreadBufferIterator: buffer_size=2
initializing end, start working
round 0:[ 100] 13 sec elapsed[1] train-error:0.999570
round 1:[ 100] 33 sec elapsed[2] train-error:0.999570
round 2:[ 100] 52 sec elapsed[3] train-error:0.999570
round 3:[ 100] 72 sec elapsed[4] train-error:0.999570


CXXNetTrainer, devCPU=1
ConvolutionLayer: nstep=256
ConvolutionLayer: nstep=256
ConvolutionLayer: nstep=128
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
node[0].shape: 256,3,48,48
node[1].shape: 256,96,12,12
node[2].shape: 256,96,12,12
node[3].shape: 256,96,6,6
node[4].shape: 256,128,8,8
node[5].shape: 256,128,8,8
node[6].shape: 256,128,8,8
node[7].shape: 256,128,8,8
node[8].shape: 256,128,4,4
node[9].shape: 1,1,256,2048
node[10].shape: 1,1,256,512
node[11].shape: 1,1,256,512
node[12].shape: 1,1,256,512
node[13].shape: 1,1,256,512
node[14].shape: 1,1,256,121
ThreadImagePageIterator:image_list=./train.lst, bin=./train.bin
loading mean image from models/image_mean.bin
ThreadBufferIterator: buffer_size=2
ThreadImagePageIterator:image_list=./train.lst, bin=./train.bin
loading mean image from models/image_mean.bin
ThreadBufferIterator: buffer_size=2
initializing end, start working
round 0:[ 100] 311 sec elapsed[1] train-error:0.776947
round 1:[ 100] 815 sec elapsed[2] train-error:0.689883
update round 2

@StevenHickson
Copy link

Saw this on the kaggle forums and it fixed it for me:

"I discovered a warning during the compilation of 'tensor_gpu-inl.cuh' which is part of mshadow and located in 'cxxnet-master/mshadow/mshadow/cuda'. The warning said that the CUDA architecture of the GPU could not be determined and will be set to 2.0 automatically. It seemed that the CUDA_ARCH macro, that is checked for the architecture, was broken. Setting this macro by hand like

#define CUDA_ARCH 300

resolved the issue for me (in my case, the 300 comes from the GTX 680 having compute capability 3.0). Maybe this can help with your problem too...
"

Except mine is a GTX 760 so I put #define CUDA_ARCH 500. Then it gave me a redefine error and everything worked again. So just make clean and then add this line in tensor_gpu-inl.cuh and then make again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants