Various errors when training scales=320 #415

daquexian · 2018-05-05T12:43:15Z

Expected results

Training runs correctly in any proper sizes.

Actual results

Training runs correctly for some iterations, then ends at a random time. I have disabled the shuffle of dataset by modify _shuffle_roidb_inds in lib/roi_data/loader.py and tried on VOC twice, the program crashed at different iterations respectively.

What's more, the error messages are different in different runs. Sometimes it is

*** Error in `python': double free or corruption (out): 0x00007f42fc228790 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f46092137e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f460921c37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f460922053c]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x1edef)[0x7f4600ca7def]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x22032)[0x7f4600cab032]
python(PyEval_EvalFrameEx+0x6162)[0x4ca0d2]
python(PyEval_EvalFrameEx+0x5e0f)[0x4c9d7f]
python(PyEval_EvalCodeEx+0x255)[0x4c2705]
python[0x4de69e]
python(PyObject_Call+0x43)[0x4b0c93]
python[0x4f452e]
python(PyObject_Call+0x43)[0x4b0c93]
python(PyEval_CallObjectWithKeywords+0x30)[0x4ce540]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x83d40)[0x7f45fe003d40
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x854c1)[0x7f45fe0054c1
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x4ca1b)[0x7f45fdfcca1b
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x98dd8)[0x7f45fe018dd8
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x95155)[0x7f45fe015155
]
/usr/local/lib/libcaffe2.so(_ZN6caffe26DAGNet5RunAtEiRKSt6vectorIiSaIiEE+0x5a)[0x7f45f5818c5a]
/usr/local/lib/libcaffe2.so(_ZN6caffe210DAGNetBase14WorkerFunctionEv+0x305)[0x7f45f5817a15]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f4603171c80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f460956d6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f46092a341d]
======= Memory map: ========
00400000-006e9000 r-xp 00000000 08:05 15099550902                        /usr/bin/python2.7
008e8000-008ea000 r--p 002e8000 08:05 15099550902                        /usr/bin/python2.7
008ea000-00961000 rw-p 002ea000 08:05 15099550902                        /usr/bin/python2.7
00961000-00984000 rw-p 00000000 00:00 0 
02372000-1bcfe000 rw-p 00000000 00:00 0                                  [heap]
200000000-200200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200200000-200400000 ---p 00000000 00:00 0 
200400000-200404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200404000-200600000 ---p 00000000 00:00 0 
200600000-200a00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200a00000-201800000 ---p 00000000 00:00 0 
201800000-201804000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
201804000-201a00000 ---p 00000000 00:00 0 
201a00000-201e00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
201e00000-202c00000 ---p 00000000 00:00 0 
202c00000-202c04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
202c04000-202e00000 ---p 00000000 00:00 0 
202e00000-203200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
203200000-204000000 ---p 00000000 00:00 0 
204000000-204004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
204004000-204200000 ---p 00000000 00:00 0 
204200000-204600000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
204600000-205400000 ---p 00000000 00:00 0 
205400000-205404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
205404000-205600000 ---p 00000000 00:00 0 
205600000-205a00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
205a00000-206800000 ---p 00000000 00:00 0 
206800000-206804000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
206804000-206a00000 ---p 00000000 00:00 0 
206a00000-206e00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
206e00000-207c00000 ---p 00000000 00:00 0 
207c00000-207c04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
207c04000-207e00000 ---p 00000000 00:00 0 
207e00000-208200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
208200000-209000000 ---p 00000000 00:00 0 
209000000-209004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
209004000-209200000 ---p 00000000 00:00 0 
209200000-209600000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
209600000-20a400000 ---p 00000000 00:00 0 
20a400000-20a404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20a404000-20a600000 ---p 00000000 00:00 0 
20a600000-20aa00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20aa00000-20aa04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20aa04000-20ac00000 ---p 00000000 00:00 0 
20ac00000-20b000000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20b000000-20b004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl

and sometimes it is

*** Aborted at 1525523656 (unix time) try "date -d @1525523656" if you are using GNU date ***
PC: @     0x7f7c0376048a (unknown)
*** SIGSEGV (@0x0) received by PID 89364 (TID 0x7f79559fb700) from PID 0; stack trace: ***
    @     0x7f7c03abd390 (unknown)
    @     0x7f7c0376048a (unknown)
    @     0x7f7c03763cde (unknown)
    @     0x7f7c03766184 __libc_malloc
    @     0x7f7b7f400a36 (unknown)
    @     0x7f7b7f979634 (unknown)
    @     0x7f7b7fa10d34 (unknown)
    @     0x7f7b7fa131b7 (unknown)
    @     0x7f7b7f409ecb (unknown)
    @     0x7f7b7f40a40c cudnnConvolutionBackwardFilter
    @     0x7f7bc570ac3b _ZN6caffe210CuDNNState7executeIRZNS_19CudnnConvGradientOp13DoRunWithTypeIfffffffEEbvEUlPS0_E1_EEvP11CUstream_stOT_
    @     0x7f7bc571337c caffe2::CudnnConvGradientOp::DoRunWithType<>()
    @     0x7f7bc56fead0 caffe2::CudnnConvGradientOp::RunOnDevice()
    @     0x7f7bc568694b caffe2::Operator<>::Run()
    @     0x7f7bf797ec5a caffe2::DAGNet::RunAt()
    @     0x7f7bf797da15 caffe2::DAGNetBase::WorkerFunction()
    @     0x7f7bfd6b7c80 (unknown)
    @     0x7f7c03ab36ba start_thread
    @     0x7f7c037e941d clone
    @                0x0 (unknown)

Detailed steps to reproduce

In an existing config, modify TRAIN.SCALES to (320,), TRAIN.MAX_SIZE to 500. Since I was using an FPN config, I modified FPN.RPN_ANCHOR_START_SIZE to 16 and ROI_CANONICAL_SCALE to 90.

I have tested on COCO and VOC, both fails.

System information

Operating system: Ubuntu 16.04
Compiler version: gcc 5.4.0
CUDA version: 9.1
cuDNN version: 7
NVIDIA driver version: 387.26
GPU models (for all devices if they are not all the same): P40 x 4
PYTHONPATH environment variable: null
python --version output: Python 2.7.12

The text was updated successfully, but these errors were encountered:

v-ilin · 2018-05-06T11:15:26Z

I have the same problem with COCO dataset and sometimes it happens in test time too, not only while training.

daquexian · 2018-05-06T11:31:19Z

And what's more when I reduced the TEST.RPN_PRE_NMS_TOP_N from 1000 to 100, sometimes (~1 in 5 times) the test runs fine, sometimes models (including my own model and the faster-rcnn-R-50-FPN_1x model in model zoo) crashed in test time when I ran test_net.py. The error messages are also various.

moyans · 2018-05-15T06:56:51Z

I am quite confused about ROI_CANONICAL_SCALE: 90，how can i get it ?

daquexian · 2018-05-15T07:55:54Z

@moyans You can use grep -rn <detectron directory> -e "ROI_CANONICAL_SCALE" --include=*.py to get all lines containing "ROI_CANONICAL_SCALE" in .py files

moyans · 2018-05-15T08:29:09Z

@daquexian Sorry, I didn't explain it clearly. I know where it is. The original value is 224. I'm just wondering how to calculate this number.

daquexian · 2018-05-15T08:37:04Z

@moyans I calculated it by 224 * 320 / 800.

pfollmann · 2018-05-16T16:11:12Z

Did you find a solution for your problem? I got something similar when I reduce the image-scale for my own dataset to 360x480.. I think it is related to the cython_nms, since if I configure TRAIN/TEST.RPN_NMS_THRESH: 0.0 The model trains and evaluates (but with worse results of course..).
Another workaround was to use a larger TRAIN/TEST.SCALE (e.g. 800, with TRAIN/TEST.MAX_SIZE: 1333)

I also tried to debug this, but I still couldn't figure out the problem... using the CPU NMS (without cython) from the old py-faster-rcnn repo I found that sometimes you divide by zero inside the NMS (when the two boxes both have area 0). This should be fixed by setting TRAIN/TEST.RPN_MIN_SIZE > 0. But it seems that this is not the only problem...

Could you please try to switch off the RPN_NMS (set RPN_NMS_THRESH to 0.0) and see if it works then?

Maybe it could also be a problem that the number of anchors/proposals is too small when we use NMS on the regressed boxes that were generated on very small feature maps (due to the reduced input image-size)

daquexian · 2018-05-16T16:19:57Z

@pfollmann Thanks for your information! I may try it tomorrow. Does the bug still exist even though TRAIN/TEST.RPN_MIN_SIZE > 0?

pfollmann · 2018-05-16T17:27:46Z

Yes, unfortunately with TRAIN/TEST.RPN_MIN_SIZE I still got errors at random iterations in your above-mentioned style..

pfollmann · 2018-05-17T14:56:57Z

I think that I found the problem: It is in detectron/utils/cython_nms.pyx:

cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

The numpy.argsort function seems to be buggy at this point (no clue why). I replaced it with the cython argsort implementation from https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx. To make it work the following changes are necessary:

Place argsort.pyx in detectron/utils
Change line 13 in argsort.pyx to
ctypedef cnp.float32_t FLOAT_t
register the file in setup.py (similar to cython_nms.pyx and cython_bbox.pyx)
include it in detectron/utils/cython_nms.pyx, i.e. change the file as follows:
import utils.argsort as argsort
...
cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
argsort.argsort(-scores, order)
and run 'make' in detectron to compile the cython-modules again.

For me the training works fine now for 20k iterations and also the inference had no more seg-faults (including a little speed-up ;-) )

The open question still is why cython_nms.pyx worked fine for other configurations of the TRAIN/TEST.SCALE? In my experience not the image-scale was the problem but the size of objects that are getting very small in the case of rescaled images to small sizes.

Hope that helps!

PS: My current Detectron-version is quite far from the master, therefore I'm not sure if I find time to launch a PR soon..

daquexian · 2018-05-17T15:36:58Z

@pfollmann Wow! So coool! I will try it soon. Thanks!

…

On Thu, May 17, 2018, 10:57 PM pfollmann ***@***.***> wrote: I think that I found the problem: It is in detectron/utils/cython_nms.pyx: cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1] The numpy.argsort function seems to be buggy at this point (no clue why). I replaced it with the cython argsort implementation from https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx. To make it work the following changes are necessary: - Place argsort.pyx in detectron/utils - Change line 13 in argsort.pyx to ctypedef cnp.float32_t FLOAT_t - register the file in setup.py (similar to cython_nms.pyx and cython_bbox.pyx) - include it in detectron/utils/cython_nms.pyx, i.e. change the file as follows: import utils.argsort as argsort ... cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp) argsort.argsort(-scores, order) - and run 'make' in detectron to compile the cython-modules again. For me the training works fine now for 20k iterations and also the inference had no more seg-faults (including a little speed-up ;-) ) The open question still is why cython_nms.pyx worked fine for other configurations of the TRAIN/TEST.SCALE? In my experience not the image-scale was the problem but the size of objects that are getting very small in the case of rescaled images to small sizes. Hope that helps! PS: My current Detectron-version is quite far from the master, therefore I'm not sure if I find time to launch a PR soon.. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#415 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALEcn-6IxvfvqGsyY2sjINVV_0D92lXdks5tzY_cgaJpZM4Tzn30> .

daquexian · 2018-05-17T17:04:28Z

@pfollmann It works! Thanks! I'd like to remain this issue open because the patch has not been merged into master.

Looking forward to your PR :) You might want to fetch the master and modify the corresponding files. The steps on master are nothing different from those you pointed out above.

lzhbrian · 2018-05-30T03:52:46Z

@pfollmann Thanks, save my day!

shenghsiaowong · 2018-09-03T12:58:41Z

hi,i meet this problem when use the above method,do you have any methhods to solve it,thank you very much

import detectron.utils.cython_nms as cython_nms
File "detectron/utils/cython_nms.pyx", line 27, in init detectron.utils.cython_nms
import utils.argsort as argsort
ImportError: No module named utils.argsort

daquexian · 2018-09-03T13:25:13Z

@shenghsiaowong I think you should use import detectron.utils.argsort as argsort because the project structure changed after @pfollmann posted his solution.

shenghsiaowong · 2018-09-03T13:45:01Z

，，，i have change it，but it does not work，i know thisis a small issue ，but ihave no idea
File "detectron/utils/cython_nms.pyx", line 27, in init detectron.utils.cython_nms
#import detectron.utils.argsort as argsort
ImportError: No module named utils.argsort

shenghsiaowong · 2018-09-03T13:50:03Z

what is meaning of this？thank you
import detectron.utils.cython_nms as cython_nms
File "detectron/utils/cython_nms.pyx", line 28, in init detectron.utils.cython_nms
import detectron.utils.argsort as argsort
ImportError: dynamic module does not define init function (initargsort)

daquexian · 2018-09-16T00:19:14Z

@shenghsiaowong sorry I haven't met this error. @pfollmann do you have time to send a PR for your excellent solution so that every user can benefit from it seamlessly? :)

karenyun · 2018-10-04T03:19:57Z

@pfollmann @daquexian Hi, thanks for the solution, but when I try your advice, other error happened: Error in 'python': free() invalid next size (fast)
Could you give any appreciate advice about it?

SHERLOCKLS · 2018-10-29T10:42:49Z

@shenghsiaowong
try to paste "import detectron.utils.argsort as argsort"
after
cimport cython
import numpy as np
cimport numpy as np

StepOITD · 2018-11-15T15:59:08Z

@karenyun I am meeting same problem here, did you figure it out?

CSdidi · 2018-11-28T07:39:21Z

@karenyun @StepOITD I met the same problem and solve it as follows:
(1) changing the file detectron/utils/cython_nms.pyx as pfollmann suggested
(2) putting the following sentences after the definition of ndets. In other words, change the original code snippet:

    cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

    cdef int ndets = dets.shape[0]
    cdef np.ndarray[np.int_t, ndim=1] suppressed = \
            np.zeros((ndets), dtype=np.int)

to

    cdef int ndets = dets.shape[0]
    cdef np.ndarray[np.int_t, ndim=1] suppressed = \
            np.zeros((ndets), dtype=np.int)
    
    cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
    argsort.argsort(-scores, order)

Hope it helps.

daquexian changed the title ~~Various errors when training in scales=300~~ Various errors when training scales=300 May 5, 2018

daquexian changed the title ~~Various errors when training scales=300~~ Various errors when training scales=320 May 6, 2018

shengtao96 mentioned this issue Jul 23, 2018

*** Aborted at 1525746556 (unix time) try "date -d @1525746556" if you are using GNU date *** #578

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various errors when training scales=320 #415

Various errors when training scales=320 #415

daquexian commented May 5, 2018 •

edited

v-ilin commented May 6, 2018

daquexian commented May 6, 2018 •

edited

moyans commented May 15, 2018

daquexian commented May 15, 2018

moyans commented May 15, 2018

daquexian commented May 15, 2018

pfollmann commented May 16, 2018 •

edited

daquexian commented May 16, 2018

pfollmann commented May 16, 2018

pfollmann commented May 17, 2018 •

edited

daquexian commented May 17, 2018 via email •

edited

daquexian commented May 17, 2018 •

edited

lzhbrian commented May 30, 2018

shenghsiaowong commented Sep 3, 2018

daquexian commented Sep 3, 2018 •

edited

shenghsiaowong commented Sep 3, 2018

shenghsiaowong commented Sep 3, 2018

daquexian commented Sep 16, 2018 •

edited

karenyun commented Oct 4, 2018

SHERLOCKLS commented Oct 29, 2018 •

edited

StepOITD commented Nov 15, 2018

CSdidi commented Nov 28, 2018 •

edited

Various errors when training scales=320 #415

Various errors when training scales=320 #415

Comments

daquexian commented May 5, 2018 • edited

Expected results

Actual results

Detailed steps to reproduce

System information

v-ilin commented May 6, 2018

daquexian commented May 6, 2018 • edited

moyans commented May 15, 2018

daquexian commented May 15, 2018

moyans commented May 15, 2018

daquexian commented May 15, 2018

pfollmann commented May 16, 2018 • edited

daquexian commented May 16, 2018

pfollmann commented May 16, 2018

pfollmann commented May 17, 2018 • edited

daquexian commented May 17, 2018 via email • edited

daquexian commented May 17, 2018 • edited

lzhbrian commented May 30, 2018

shenghsiaowong commented Sep 3, 2018

daquexian commented Sep 3, 2018 • edited

shenghsiaowong commented Sep 3, 2018

shenghsiaowong commented Sep 3, 2018

daquexian commented Sep 16, 2018 • edited

karenyun commented Oct 4, 2018

SHERLOCKLS commented Oct 29, 2018 • edited

StepOITD commented Nov 15, 2018

CSdidi commented Nov 28, 2018 • edited

daquexian commented May 5, 2018 •

edited

daquexian commented May 6, 2018 •

edited

pfollmann commented May 16, 2018 •

edited

pfollmann commented May 17, 2018 •

edited

daquexian commented May 17, 2018 via email •

edited

daquexian commented May 17, 2018 •

edited

daquexian commented Sep 3, 2018 •

edited

daquexian commented Sep 16, 2018 •

edited

SHERLOCKLS commented Oct 29, 2018 •

edited

CSdidi commented Nov 28, 2018 •

edited