Process does not exit when CUDARuntimeError occurs #1222

andremoeller · 2018-05-04T05:28:03Z

Hi,

I also posted in Chainer repository, but I think this issue may be more accurate here: chainer/chainer#4709

My problem is: If I misuse MultiprocessParallelUpdater such that an exception is thrown, the python process does not exit, but just hangs.

The code to reproduce is at the bottom. It causes cupy.cuda.runtime.CUDARuntimeError: cudaErrorInitializationError: initialization error, but the python process does not exit.

I'd like the process to exit, but I am not sure how. Is this possible?

Thank you.

Conditions
- Chainer version 4.0.0
- CuPy version 4.0.0
- OS/Platform ubuntu 16.04
- CUDA/cuDNN version 9.0
Code to reproduce

import chainer
import chainer.functions as F
import chainer.links as L

class MLP(chainer.Chain):

    def __init__(self, n_units, n_out):
        super(MLP, self).__init__()
        with self.init_scope():
            # the size of the inputs to each layer will be inferred
            self.l1 = L.Linear(None, n_units)  # n_in -> n_units
            self.l2 = L.Linear(None, n_units)  # n_units -> n_units
            self.l3 = L.Linear(None, n_out)  # n_units -> n_out

    def __call__(self, x):
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        return self.l3(h2)

def train():
    train, test = chainer.datasets.get_mnist()

    batch_size = 64
    learning_rate = 0.05

    model = L.Classifier(MLP(1000, 10))

    optimizer = chainer.optimizers.MomentumSGD(learning_rate)
    optimizer.setup(model)
    optimizer.add_hook(chainer.optimizer.WeightDecay(5e-4))

    # Set up a trainer
    num_gpus = 2
    devices = range(num_gpus)

    # this is just to force the error
    chainer.cuda.get_device_from_id(0).use()

    train_iters = [chainer.iterators.MultiprocessIterator(i, batch_size, n_processes=num_gpus) \
                   for i in chainer.datasets.split_dataset_n_random(train, len(devices))]

    updater = training.updaters.MultiprocessParallelUpdater(train_iters, optimizer, devices=range(num_gpus))
    updater.setup_workers()


if __name__=="__main__":
    train()

Stacktrace:

algo-1_1  | Process _Worker-1:
algo-1_1  | Traceback (most recent call last):
algo-1_1  |   File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
algo-1_1  |     self.run()
algo-1_1  |   File "/usr/local/lib/python3.5/dist-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 45, in run
algo-1_1  |     dev.use()
algo-1_1  |   File "cupy/cuda/device.pyx", line 101, in cupy.cuda.device.Device.use
algo-1_1  |   File "cupy/cuda/device.pyx", line 107, in cupy.cuda.device.Device.use
algo-1_1  |   File "cupy/cuda/runtime.pyx", line 184, in cupy.cuda.runtime.setDevice
algo-1_1  |   File "cupy/cuda/runtime.pyx", line 136, in cupy.cuda.runtime.check_status
algo-1_1  | cupy.cuda.runtime.CUDARuntimeError: cudaErrorInitializationError: initialization error

I had a problem similar to this one, but with mpi4py / chainermn, so the answer might be similar: chainer/chainermn#236

The text was updated successfully, but these errors were encountered:

kmaehashi · 2018-05-17T02:12:49Z

Closing this issue as it has been discussed in chainer/chainer#4709 and fix for Chainer has merged.

kmaehashi closed this as completed May 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process does not exit when CUDARuntimeError occurs #1222

Process does not exit when CUDARuntimeError occurs #1222

andremoeller commented May 4, 2018 •

edited

kmaehashi commented May 17, 2018

Process does not exit when CUDARuntimeError occurs #1222

Process does not exit when CUDARuntimeError occurs #1222

Comments

andremoeller commented May 4, 2018 • edited

kmaehashi commented May 17, 2018

andremoeller commented May 4, 2018 •

edited