GPU does not train for kaggle_bowl example #28

hothHowler · 2015-01-06T07:27:16Z

I've been able to run the kaggle_bowl example but the GPU runs report a training-error that remains flat. I run the same bowl.conf only with dev=cpu and the training seems to begin to train fine. (I've appended the output of both below). I haven't otherwise changed bowl.conf. I did run the GPU on the MNIST example and it worked fine.

Do you have an idea of what might be going on here? Thanks!

Use CUDA Device 0: Tesla C1060
CXXNetTrainer, devCPU=0
ConvolutionLayer: nstep=256
ConvolutionLayer: nstep=256
ConvolutionLayer: nstep=128
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
node[0].shape: 256,3,48,48
node[1].shape: 256,96,12,12
node[2].shape: 256,96,12,12
node[3].shape: 256,96,6,6
node[4].shape: 256,128,8,8
node[5].shape: 256,128,8,8
node[6].shape: 256,128,8,8
node[7].shape: 256,128,8,8
node[8].shape: 256,128,4,4
node[9].shape: 1,1,256,2048
node[10].shape: 1,1,256,512
node[11].shape: 1,1,256,512
node[12].shape: 1,1,256,512
node[13].shape: 1,1,256,512
node[14].shape: 1,1,256,121
ThreadImagePageIterator:image_list=./train.lst, bin=./train.bin
loading mean image from models/image_mean.bin
ThreadBufferIterator: buffer_size=2
ThreadImagePageIterator:image_list=./train.lst, bin=./train.bin
loading mean image from models/image_mean.bin
ThreadBufferIterator: buffer_size=2
initializing end, start working
round 0:[ 100] 13 sec elapsed[1] train-error:0.999570
round 1:[ 100] 33 sec elapsed[2] train-error:0.999570
round 2:[ 100] 52 sec elapsed[3] train-error:0.999570
round 3:[ 100] 72 sec elapsed[4] train-error:0.999570

CXXNetTrainer, devCPU=1
ConvolutionLayer: nstep=256
ConvolutionLayer: nstep=256
ConvolutionLayer: nstep=128
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
SGDUpdater: eta=0.010000, mom=0.900000
SGDUpdater: eta=0.020000, mom=0.900000
node[0].shape: 256,3,48,48
node[1].shape: 256,96,12,12
node[2].shape: 256,96,12,12
node[3].shape: 256,96,6,6
node[4].shape: 256,128,8,8
node[5].shape: 256,128,8,8
node[6].shape: 256,128,8,8
node[7].shape: 256,128,8,8
node[8].shape: 256,128,4,4
node[9].shape: 1,1,256,2048
node[10].shape: 1,1,256,512
node[11].shape: 1,1,256,512
node[12].shape: 1,1,256,512
node[13].shape: 1,1,256,512
node[14].shape: 1,1,256,121
ThreadImagePageIterator:image_list=./train.lst, bin=./train.bin
loading mean image from models/image_mean.bin
ThreadBufferIterator: buffer_size=2
ThreadImagePageIterator:image_list=./train.lst, bin=./train.bin
loading mean image from models/image_mean.bin
ThreadBufferIterator: buffer_size=2
initializing end, start working
round 0:[ 100] 311 sec elapsed[1] train-error:0.776947
round 1:[ 100] 815 sec elapsed[2] train-error:0.689883
update round 2

StevenHickson · 2015-02-11T17:25:34Z

Saw this on the kaggle forums and it fixed it for me:

"I discovered a warning during the compilation of 'tensor_gpu-inl.cuh' which is part of mshadow and located in 'cxxnet-master/mshadow/mshadow/cuda'. The warning said that the CUDA architecture of the GPU could not be determined and will be set to 2.0 automatically. It seemed that the CUDA_ARCH macro, that is checked for the architecture, was broken. Setting this macro by hand like

#define CUDA_ARCH 300

resolved the issue for me (in my case, the 300 comes from the GTX 680 having compute capability 3.0). Maybe this can help with your problem too...
"

Except mine is a GTX 760 so I put #define CUDA_ARCH 500. Then it gave me a redefine error and everything worked again. So just make clean and then add this line in tensor_gpu-inl.cuh and then make again.

antinucleon closed this as completed Mar 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU does not train for kaggle_bowl example #28

GPU does not train for kaggle_bowl example #28

hothHowler commented Jan 6, 2015

StevenHickson commented Feb 11, 2015

GPU does not train for kaggle_bowl example #28

GPU does not train for kaggle_bowl example #28

Comments

hothHowler commented Jan 6, 2015

StevenHickson commented Feb 11, 2015