Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Various errors when training scales=320 #415

Open
daquexian opened this issue May 5, 2018 · 22 comments
Open

Various errors when training scales=320 #415

daquexian opened this issue May 5, 2018 · 22 comments

Comments

@daquexian
Copy link
Contributor

daquexian commented May 5, 2018

Expected results

Training runs correctly in any proper sizes.

Actual results

Training runs correctly for some iterations, then ends at a random time. I have disabled the shuffle of dataset by modify _shuffle_roidb_inds in lib/roi_data/loader.py and tried on VOC twice, the program crashed at different iterations respectively.

What's more, the error messages are different in different runs. Sometimes it is

*** Error in `python': double free or corruption (out): 0x00007f42fc228790 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f46092137e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f460921c37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f460922053c]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x1edef)[0x7f4600ca7def]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x22032)[0x7f4600cab032]
python(PyEval_EvalFrameEx+0x6162)[0x4ca0d2]
python(PyEval_EvalFrameEx+0x5e0f)[0x4c9d7f]
python(PyEval_EvalCodeEx+0x255)[0x4c2705]
python[0x4de69e]
python(PyObject_Call+0x43)[0x4b0c93]
python[0x4f452e]
python(PyObject_Call+0x43)[0x4b0c93]
python(PyEval_CallObjectWithKeywords+0x30)[0x4ce540]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x83d40)[0x7f45fe003d40
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x854c1)[0x7f45fe0054c1
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x4ca1b)[0x7f45fdfcca1b
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x98dd8)[0x7f45fe018dd8
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x95155)[0x7f45fe015155
]
/usr/local/lib/libcaffe2.so(_ZN6caffe26DAGNet5RunAtEiRKSt6vectorIiSaIiEE+0x5a)[0x7f45f5818c5a]
/usr/local/lib/libcaffe2.so(_ZN6caffe210DAGNetBase14WorkerFunctionEv+0x305)[0x7f45f5817a15]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f4603171c80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f460956d6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f46092a341d]
======= Memory map: ========
00400000-006e9000 r-xp 00000000 08:05 15099550902                        /usr/bin/python2.7
008e8000-008ea000 r--p 002e8000 08:05 15099550902                        /usr/bin/python2.7
008ea000-00961000 rw-p 002ea000 08:05 15099550902                        /usr/bin/python2.7
00961000-00984000 rw-p 00000000 00:00 0 
02372000-1bcfe000 rw-p 00000000 00:00 0                                  [heap]
200000000-200200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200200000-200400000 ---p 00000000 00:00 0 
200400000-200404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200404000-200600000 ---p 00000000 00:00 0 
200600000-200a00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200a00000-201800000 ---p 00000000 00:00 0 
201800000-201804000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
201804000-201a00000 ---p 00000000 00:00 0 
201a00000-201e00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
201e00000-202c00000 ---p 00000000 00:00 0 
202c00000-202c04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
202c04000-202e00000 ---p 00000000 00:00 0 
202e00000-203200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
203200000-204000000 ---p 00000000 00:00 0 
204000000-204004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
204004000-204200000 ---p 00000000 00:00 0 
204200000-204600000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
204600000-205400000 ---p 00000000 00:00 0 
205400000-205404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
205404000-205600000 ---p 00000000 00:00 0 
205600000-205a00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
205a00000-206800000 ---p 00000000 00:00 0 
206800000-206804000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
206804000-206a00000 ---p 00000000 00:00 0 
206a00000-206e00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
206e00000-207c00000 ---p 00000000 00:00 0 
207c00000-207c04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
207c04000-207e00000 ---p 00000000 00:00 0 
207e00000-208200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
208200000-209000000 ---p 00000000 00:00 0 
209000000-209004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
209004000-209200000 ---p 00000000 00:00 0 
209200000-209600000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
209600000-20a400000 ---p 00000000 00:00 0 
20a400000-20a404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20a404000-20a600000 ---p 00000000 00:00 0 
20a600000-20aa00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20aa00000-20aa04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20aa04000-20ac00000 ---p 00000000 00:00 0 
20ac00000-20b000000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20b000000-20b004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl

and sometimes it is

*** Aborted at 1525523656 (unix time) try "date -d @1525523656" if you are using GNU date ***
PC: @     0x7f7c0376048a (unknown)
*** SIGSEGV (@0x0) received by PID 89364 (TID 0x7f79559fb700) from PID 0; stack trace: ***
    @     0x7f7c03abd390 (unknown)
    @     0x7f7c0376048a (unknown)
    @     0x7f7c03763cde (unknown)
    @     0x7f7c03766184 __libc_malloc
    @     0x7f7b7f400a36 (unknown)
    @     0x7f7b7f979634 (unknown)
    @     0x7f7b7fa10d34 (unknown)
    @     0x7f7b7fa131b7 (unknown)
    @     0x7f7b7f409ecb (unknown)
    @     0x7f7b7f40a40c cudnnConvolutionBackwardFilter
    @     0x7f7bc570ac3b _ZN6caffe210CuDNNState7executeIRZNS_19CudnnConvGradientOp13DoRunWithTypeIfffffffEEbvEUlPS0_E1_EEvP11CUstream_stOT_
    @     0x7f7bc571337c caffe2::CudnnConvGradientOp::DoRunWithType<>()
    @     0x7f7bc56fead0 caffe2::CudnnConvGradientOp::RunOnDevice()
    @     0x7f7bc568694b caffe2::Operator<>::Run()
    @     0x7f7bf797ec5a caffe2::DAGNet::RunAt()
    @     0x7f7bf797da15 caffe2::DAGNetBase::WorkerFunction()
    @     0x7f7bfd6b7c80 (unknown)
    @     0x7f7c03ab36ba start_thread
    @     0x7f7c037e941d clone
    @                0x0 (unknown)

Detailed steps to reproduce

In an existing config, modify TRAIN.SCALES to (320,), TRAIN.MAX_SIZE to 500. Since I was using an FPN config, I modified FPN.RPN_ANCHOR_START_SIZE to 16 and ROI_CANONICAL_SCALE to 90.

I have tested on COCO and VOC, both fails.

System information

  • Operating system: Ubuntu 16.04
  • Compiler version: gcc 5.4.0
  • CUDA version: 9.1
  • cuDNN version: 7
  • NVIDIA driver version: 387.26
  • GPU models (for all devices if they are not all the same): P40 x 4
  • PYTHONPATH environment variable: null
  • python --version output: Python 2.7.12
@daquexian daquexian changed the title Various errors when training in scales=300 Various errors when training scales=300 May 5, 2018
@v-ilin
Copy link

v-ilin commented May 6, 2018

I have the same problem with COCO dataset and sometimes it happens in test time too, not only while training.

@daquexian
Copy link
Contributor Author

daquexian commented May 6, 2018

And what's more when I reduced the TEST.RPN_PRE_NMS_TOP_N from 1000 to 100, sometimes (~1 in 5 times) the test runs fine, sometimes models (including my own model and the faster-rcnn-R-50-FPN_1x model in model zoo) crashed in test time when I ran test_net.py. The error messages are also various.

@daquexian daquexian changed the title Various errors when training scales=300 Various errors when training scales=320 May 6, 2018
@moyans
Copy link

moyans commented May 15, 2018

I am quite confused about ROI_CANONICAL_SCALE: 90,how can i get it ?

@daquexian
Copy link
Contributor Author

@moyans You can use grep -rn <detectron directory> -e "ROI_CANONICAL_SCALE" --include=*.py to get all lines containing "ROI_CANONICAL_SCALE" in .py files

@moyans
Copy link

moyans commented May 15, 2018

@daquexian Sorry, I didn't explain it clearly. I know where it is. The original value is 224. I'm just wondering how to calculate this number.

@daquexian
Copy link
Contributor Author

@moyans I calculated it by 224 * 320 / 800.

@pfollmann
Copy link

pfollmann commented May 16, 2018

Did you find a solution for your problem? I got something similar when I reduce the image-scale for my own dataset to 360x480.. I think it is related to the cython_nms, since if I configure TRAIN/TEST.RPN_NMS_THRESH: 0.0 The model trains and evaluates (but with worse results of course..).
Another workaround was to use a larger TRAIN/TEST.SCALE (e.g. 800, with TRAIN/TEST.MAX_SIZE: 1333)

I also tried to debug this, but I still couldn't figure out the problem... using the CPU NMS (without cython) from the old py-faster-rcnn repo I found that sometimes you divide by zero inside the NMS (when the two boxes both have area 0). This should be fixed by setting TRAIN/TEST.RPN_MIN_SIZE > 0. But it seems that this is not the only problem...

Could you please try to switch off the RPN_NMS (set RPN_NMS_THRESH to 0.0) and see if it works then?

Maybe it could also be a problem that the number of anchors/proposals is too small when we use NMS on the regressed boxes that were generated on very small feature maps (due to the reduced input image-size)

@daquexian
Copy link
Contributor Author

@pfollmann Thanks for your information! I may try it tomorrow. Does the bug still exist even though TRAIN/TEST.RPN_MIN_SIZE > 0?

@pfollmann
Copy link

Yes, unfortunately with TRAIN/TEST.RPN_MIN_SIZE I still got errors at random iterations in your above-mentioned style..

@pfollmann
Copy link

pfollmann commented May 17, 2018

I think that I found the problem: It is in detectron/utils/cython_nms.pyx:

cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

The numpy.argsort function seems to be buggy at this point (no clue why). I replaced it with the cython argsort implementation from https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx. To make it work the following changes are necessary:

  • Place argsort.pyx in detectron/utils
  • Change line 13 in argsort.pyx to
    ctypedef cnp.float32_t FLOAT_t
  • register the file in setup.py (similar to cython_nms.pyx and cython_bbox.pyx)
  • include it in detectron/utils/cython_nms.pyx, i.e. change the file as follows:
    import utils.argsort as argsort
    ...
    cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
    argsort.argsort(-scores, order)
  • and run 'make' in detectron to compile the cython-modules again.

For me the training works fine now for 20k iterations and also the inference had no more seg-faults (including a little speed-up ;-) )

The open question still is why cython_nms.pyx worked fine for other configurations of the TRAIN/TEST.SCALE? In my experience not the image-scale was the problem but the size of objects that are getting very small in the case of rescaled images to small sizes.

Hope that helps!

PS: My current Detectron-version is quite far from the master, therefore I'm not sure if I find time to launch a PR soon..

@daquexian
Copy link
Contributor Author

daquexian commented May 17, 2018 via email

@daquexian
Copy link
Contributor Author

daquexian commented May 17, 2018

@pfollmann It works! Thanks! I'd like to remain this issue open because the patch has not been merged into master.

Looking forward to your PR :) You might want to fetch the master and modify the corresponding files. The steps on master are nothing different from those you pointed out above.

@lzhbrian
Copy link

@pfollmann Thanks, save my day!

@shenghsiaowong
Copy link

hi,i meet this problem when use the above method,do you have any methhods to solve it,thank you very much

import detectron.utils.cython_nms as cython_nms
File "detectron/utils/cython_nms.pyx", line 27, in init detectron.utils.cython_nms
import utils.argsort as argsort
ImportError: No module named utils.argsort

@daquexian
Copy link
Contributor Author

daquexian commented Sep 3, 2018

@shenghsiaowong I think you should use import detectron.utils.argsort as argsort because the project structure changed after @pfollmann posted his solution.

@shenghsiaowong
Copy link

,,,i have change it,but it does not work,i know thisis a small issue ,but ihave no idea
File "detectron/utils/cython_nms.pyx", line 27, in init detectron.utils.cython_nms
#import detectron.utils.argsort as argsort
ImportError: No module named utils.argsort

@shenghsiaowong
Copy link

what is meaning of this?thank you
import detectron.utils.cython_nms as cython_nms
File "detectron/utils/cython_nms.pyx", line 28, in init detectron.utils.cython_nms
import detectron.utils.argsort as argsort
ImportError: dynamic module does not define init function (initargsort)

@daquexian
Copy link
Contributor Author

daquexian commented Sep 16, 2018

@shenghsiaowong sorry I haven't met this error. @pfollmann do you have time to send a PR for your excellent solution so that every user can benefit from it seamlessly? :)

@karenyun
Copy link

karenyun commented Oct 4, 2018

@pfollmann @daquexian Hi, thanks for the solution, but when I try your advice, other error happened: Error in 'python': free() invalid next size (fast)
Could you give any appreciate advice about it?

@SHERLOCKLS
Copy link

SHERLOCKLS commented Oct 29, 2018

@shenghsiaowong
try to paste "import detectron.utils.argsort as argsort"
after
cimport cython
import numpy as np
cimport numpy as np

@StepOITD
Copy link

@karenyun I am meeting same problem here, did you figure it out?

@CSdidi
Copy link

CSdidi commented Nov 28, 2018

@karenyun @StepOITD I met the same problem and solve it as follows:
(1) changing the file detectron/utils/cython_nms.pyx as pfollmann suggested
(2) putting the following sentences after the definition of ndets. In other words, change the original code snippet:

    cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

    cdef int ndets = dets.shape[0]
    cdef np.ndarray[np.int_t, ndim=1] suppressed = \
            np.zeros((ndets), dtype=np.int)

to

    cdef int ndets = dets.shape[0]
    cdef np.ndarray[np.int_t, ndim=1] suppressed = \
            np.zeros((ndets), dtype=np.int)
    
    cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
    argsort.argsort(-scores, order) 

Hope it helps.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants