Convolution error #15

RutenburgIG · 2017-10-19T12:23:05Z

Caffe compiled with:

cmake .. -DPROTOBUF_INCLUDE_DIR="/beegfs/120x/home/ilia/protobuf/include/" -DUSE_NCCL=True -DCUDA_ARCH_NAME=Manual -DCUDA_ARCH_BIN="30 35 50 52 60 61 62 70" -DCUDA_ARCH_PTX="30 35 50 52 60 61 62 70" -DCUDA_NVCC_FLAGS=--Wno-deprecated-gpu-targets -Wno-dev

-- Boost version: 1.54.0
-- Found the following Boost libraries:
--   system
--   thread
--   filesystem
-- Found gflags  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libgflags.so)
-- Found glog    (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libglog.so)
-- Found PROTOBUF Compiler: /beegfs/120x/home/ilia/protobuf/bin/protoc
-- Found lmdb    (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/liblmdb.so)
-- Found LevelDB (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libleveldb.so)
-- Found Snappy  (include: /usr/include, library: /usr/lib/libsnappy.so)
-- Found JPEGTurbo: /usr/include
-- CUDA detected: 9.0
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so (found version "7.0")
-- Added CUDA NVCC flags for: sm_30 sm_35 sm_50 sm_52 sm_60 sm_61 sm_62 sm_70 compute_30 compute_35 compute_50 compute_52 compute_60 compute_61 compute_62 compute_70
-- Found OpenCV 2.x: /usr/share/OpenCV
-- Found Atlas: /usr/include
-- Found Atlas (include: /usr/include, library: /usr/lib/libatlas.so)
-- Found PythonInterp: /beegfs/120x/home/ilia/nvcaffe_comp/bin/python2.7 (found suitable version "2.7.6", minimum required is "2.7")
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython2.7.so (found suitable version "2.7.6", minimum required is "2.7")
-- Found NumPy: /beegfs/120x/home/ilia/nvcaffe_comp/local/lib/python2.7/site-packages/numpy/core/include (found suitable version "1.13.1", minimum required is "1.7.1")
-- NumPy ver. 1.13.1 found (include: /beegfs/120x/home/ilia/nvcaffe_comp/local/lib/python2.7/site-packages/numpy/core/include)
-- Boost version: 1.54.0
-- Found the following Boost libraries:
--   python
-- Could NOT find Doxygen (missing:  DOXYGEN_EXECUTABLE)
-- Found NCCL: /usr/include
-- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl.so)
-- Found NVML: /usr/include
-- Found NVML (include: /usr/include, library: /usr/lib/nvidia-384/libnvidia-ml.so)
-- Found Git: /usr/bin/git (found version "1.9.1")
--
-- ******************* Caffe Configuration Summary *******************
-- General:
--   Version           :   0.16.4
--   Git               :   v0.16.1-404-g860701c
--   System            :   Linux
--   C++ compiler      :   /usr/bin/c++
--   Release CXX flags :   -O3 -DNDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Debug CXX flags   :   -g -DDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Build type        :   Release
--
--   BUILD_SHARED_LIBS :   ON
--   BUILD_python      :   ON
--   BUILD_matlab      :   OFF
--   BUILD_docs        :   ON
--   CPU_ONLY          :   OFF
--   USE_LEVELDB       :   ON
--   USE_LMDB          :   ON
--   ALLOW_LMDB_NOLOCK :   OFF
--   TEST_FP16         :   OFF
--
-- Dependencies:
--   BLAS              :   Yes (Atlas)
--   Boost             :   Yes (ver. 1.54)
--   glog              :   Yes
--   gflags            :   Yes
--   protobuf          :   Yes (ver. 3.4.0)
--   lmdb              :   Yes (ver. 0.9.10)
--   LevelDB           :   Yes (ver. 1.15)
--   Snappy            :   Yes (ver. 1.1.0)
--   OpenCV            :   Yes (ver. 2.4.8)
--   JPEGTurbo         :   No
--   CUDA              :   Yes (ver. 9.0)
--
-- NVIDIA CUDA:
--   Target GPU(s)     :   Manual
--   GPU arch(s)       :   sm_30 sm_35 sm_50 sm_52 sm_60 sm_61 sm_62 sm_70 compute_30 compute_35 compute_50 compute_52 compute_60 compute_61 compute_62 compute_70
--   cuDNN             :   Yes (ver. 7.0)
--   NCCL              :   Yes (ver. 2.0.5)
--   NVML              :   /usr/lib/nvidia-384/libnvidia-ml.so
--
-- Python:
--   Interpreter       :   /beegfs/120x/home/ilia/nvcaffe_comp/bin/python2.7 (ver. 2.7.6)
--   Libraries         :   /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.6)
--   NumPy             :   /beegfs/120x/home/ilia/nvcaffe_comp/local/lib/python2.7/site-packages/numpy/core/include (ver 1.13.1)
--
-- Documentaion:
--   Doxygen           :   No
--   config_file       :
--
-- Install:
--   Install path      :   /beegfs/120x/home/ilia/caffe_builds/nvc/build/install
--
-- Configuring done
-- Generating done
-- Build files have been written to: /beegfs/120x/home/ilia/caffe_builds/nvc/build

When I tried to run training process with:
./build/tools/caffe train -solver='solver.prototxt'

I got following error:

I1019 15:17:12.441572 108568 solver.cpp:315] Iteration 0 (0.371277 s), loss = 1383.36
I1019 15:17:12.441620 108568 solver.cpp:332]     Train net output #0: loss_bbox = 8.39254e-06 (* 100 = 0.000839254 loss)
I1019 15:17:12.441634 108568 solver.cpp:332]     Train net output #1: loss_cls = 2.47564 (* 500 = 1237.82 loss)
I1019 15:17:12.441706 108568 solver.cpp:332]     Train net output #2: rpn_cls_loss = 0.693479 (* 100 = 69.3479 loss)
I1019 15:17:12.441738 108568 solver.cpp:332]     Train net output #3: rpn_loss_bbox = 0.93409 (* 100 = 93.409 loss)
I1019 15:17:12.441750 108568 sgd_solver.cpp:136] Iteration 0, lr = 5e-05, m = 0.5

*** Aborted at 1508415432 (unix time) try "date -d @1508415432" if you are using GNU date ***
PC: @     0x7f8922466b8d caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
*** SIGSEGV (@0x0) received by PID 108568 (TID 0x7f89242d4900) from PID 0; stack trace: ***
    @     0x7f8920263cb0 (unknown)
    @     0x7f8922466b8d caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
    @     0x7f892248b9f0 caffe::CuDNNConvolutionLayer<>::Reshape()
    @     0x7f89223a7d0b caffe::Layer<>::Forward()
    @     0x7f892261da3b caffe::Net::ForwardFromTo()
    @     0x7f892261db97 caffe::Net::Forward()
    @     0x7f8922620325 caffe::Net::ForwardBackward()
    @     0x7f8922630652 caffe::Solver::Step()
    @     0x7f8922631395 caffe::Solver::Solve()
    @           0x40d9e8 train()
    @           0x40ae18 main
    @     0x7f892024ef45 (unknown)
    @           0x40b6fb (unknown)
    @                0x0 (unknown)

The text was updated successfully, but these errors were encountered:

drnikolaev · 2017-10-20T16:52:55Z

hi @RutenburgIG thanks,
may I have your prototxt files please?

drnikolaev · 2017-12-31T06:06:35Z

Fixed in upcoming release

drnikolaev closed this as completed Dec 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convolution error #15

Convolution error #15

RutenburgIG commented Oct 19, 2017 •

edited

Loading

drnikolaev commented Oct 20, 2017

drnikolaev commented Dec 31, 2017

Convolution error #15

Convolution error #15

Comments

RutenburgIG commented Oct 19, 2017 • edited Loading

drnikolaev commented Oct 20, 2017

drnikolaev commented Dec 31, 2017

RutenburgIG commented Oct 19, 2017 •

edited

Loading