New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A training loop got stuck in a certain condition with multi-processing updater and opencv. #2903

Open
apple2373 opened this Issue Jun 22, 2017 · 5 comments

Comments

Projects
None yet
4 participants
@apple2373
Copy link
Contributor

apple2373 commented Jun 22, 2017

A training loop got stuck in a certain condition with multi-processing updater and opencv. The issue does not appear when I use Pillow or serial updater.

  • Conditions
    • Chainer version: 2.0
    • CuPy version: 1.0.0.1
    • OS/Platform Ubuntu 14.04.5 (for PFN ppl sakura server 1)
    • CUDA/cuDNN version: V8.0.44, I don't know the way to check CUDNN version....
  • Code to reproduce
    https://github.com/apple2373/chainer-train-stuck
    Note that this requires other libraries such as chainercv, open cv, etc....
    then
    python train.py --gpu 0 --mode 0
  • Error messages, stack traces, or logs
    No message when stuck but the I can get the following message when I abort it.

stsutsui@sakura1:/mnt/sakura201/stsutsui/chainer-train-stuck$ python train.py --gpu 0 --mode 0
epoch iteration elapsed_time main/loss main/accuracy
0 1 7.3559 1.07867 0.498882
^CProcess Process-8:..........................................] 1.00%
Process Process-7:###########.................................] 35.29%
Traceback (most recent call last):rations
Traceback (most recent call last):me to finish: 0:00:00.
File "/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
self.run()
File "/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
cnt, mem_index, index = in_queue.get()
File "/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/queues.py", line 115, in get
File "/mnt/sakura201/stsutsui/anadonda2/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
self._rlock.acquire()
KeyboardInterrupt
KeyboardInterrupt

@pksorensen

This comment has been minimized.

Copy link

pksorensen commented Aug 15, 2017

I experienced similar issue

chainer@5a84acfb0706:/src/image-labelling-tool/examples/ssd$ python train.py --train /data --label_names /data/apple_orange_label_names.yml --out /data/logs --val_iteration 1
^CProcess Process-2:
Process Process-9:
Process Process-1:
Process Process-5:
Process Process-3:
Process Process-4:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 94, in get
    res = self._recv_bytes()
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
Traceback (most recent call last):
KeyboardInterrupt
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Process Process-10:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Process Process-11:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Process Process-12:
Process Process-13:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Process Process-14:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Process Process-17:
Process Process-20:
Process Process-19:
Process Process-18:
Process Process-15:
Process Process-16:
Process Process-8:
Process Process-6:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
Process Process-7:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
KeyboardInterrupt
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
    cnt, mem_index, index = in_queue.get()
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "train.py", line 70, in <module>
    args.resume)
  File "/src/image-labelling-tool/examples/ssd/train_utils.py", line 167, in train
    trainer.run()
  File "/opt/conda/lib/python3.6/site-packages/chainer/training/trainer.py", line 299, in run
    entry.extension(self)
  File "/opt/conda/lib/python3.6/site-packages/chainer/training/extensions/evaluator.py", line 137, in __call__
    result = self.evaluate()
  File "/opt/conda/lib/python3.6/site-packages/chainercv/extensions/evaluator/detection_voc_evaluator.py", line 76, in evaluate
    target.predict, it)
  File "/opt/conda/lib/python3.6/site-packages/chainercv/utils/iterator/apply_prediction_to_iterator.py", line 93, in apply_prediction_to_iterator
    _apply(predict, iterator, hook))
  File "/opt/conda/lib/python3.6/site-packages/chainercv/utils/iterator/unzip.py", line 87, in unzip
    values = next(iterator)
  File "/opt/conda/lib/python3.6/site-packages/chainercv/utils/iterator/apply_prediction_to_iterator.py", line 129, in _apply
    pred_values = predict(imgs)
  File "/opt/conda/lib/python3.6/site-packages/chainercv/links/model/ssd/ssd.py", line 204, in predict
    mb_locs, mb_confs = self(x)
  File "/opt/conda/lib/python3.6/site-packages/chainercv/links/model/ssd/ssd.py", line 130, in __call__
    return self.multibox(self.extractor(x))
  File "/opt/conda/lib/python3.6/site-packages/chainercv/links/model/ssd/ssd_vgg16.py", line 153, in __call__
    ys = super(VGG16Extractor300, self).__call__(x)
  File "/opt/conda/lib/python3.6/site-packages/chainercv/links/model/ssd/ssd_vgg16.py", line 91, in __call__
    h = F.relu(self.conv4_2(h))
  File "/opt/conda/lib/python3.6/site-packages/chainer/links/connection/convolution_2d.py", line 154, in __call__
    x, self.W, self.b, self.stride, self.pad)
  File "/opt/conda/lib/python3.6/site-packages/chainer/functions/connection/convolution_2d.py", line 439, in convolution_2d
    return func(x, W, b)
  File "/opt/conda/lib/python3.6/site-packages/chainer/function.py", line 200, in __call__
    outputs = self.forward(in_data)
  File "/opt/conda/lib/python3.6/site-packages/chainer/function.py", line 329, in forward
    return self.forward_cpu(inputs)
  File "/opt/conda/lib/python3.6/site-packages/chainer/functions/connection/convolution_2d.py", line 89, in forward_cpu
    self.col, W, ((1, 2, 3), (1, 2, 3))).astype(x.dtype, copy=False)
  File "/opt/conda/lib/python3.6/site-packages/numpy/core/numeric.py", line 1337, in tensordot
    at = a.transpose(newaxes_a).reshape(newshape_a)
KeyboardInterrupt

@mitmul mitmul assigned mitmul and unassigned niboshi Dec 18, 2017

@mitmul

This comment has been minimized.

Copy link
Member

mitmul commented Dec 18, 2017

Thank you for reporting it. I will investigate this problem with the latest stable version of Chainer and the dev version of Chainer (v3.1.0 and v4.0.0b1).

@mitmul

This comment has been minimized.

Copy link
Member

mitmul commented Jan 29, 2018

@apple2373
I reproduced it with this environment setting: https://gist.github.com/mitmul/2cd98788c07ebbae1815232a32f95728
You can trace what I saw in the same environment built with the Dockerfile.

And I also find a workaround. Just put OMP_NUM_THREADS=1 before the execution of Python just solves the problem:

OMP_NUM_THREADS=1 python train.py --gpu 0 --mode 0

This progresses the training without stucking. This is actually related to OpenCV's imread method. Because if I replace the get_example method of SegDataset with an alternative one just returns a numpy array created inside of the method, it processed without stucking.

Well, another workaround is to set cv.setNumThreads(0) right after the import cv2 as cv in the source code.

@apple2373

This comment has been minimized.

Copy link
Contributor

apple2373 commented Jan 29, 2018

Thanks! I was also able to fix the issue by either OMP_NUM_THREADS=1 or cv.setNumThreads(0). I wonder this makes the open cv slower because we are prohibiting the multi processing, but that's fine for me. A more practical problem is , it's really hard to reach this solution for novice users (including me) so it might be better to do cv.setNumThreads(0) in chainer internally?

@stale

This comment has been minimized.

Copy link

stale bot commented Dec 18, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment