Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process does not exit when CUDARuntimeError occurs #1222

Closed
andremoeller opened this issue May 4, 2018 · 1 comment
Closed

Process does not exit when CUDARuntimeError occurs #1222

andremoeller opened this issue May 4, 2018 · 1 comment

Comments

@andremoeller
Copy link

andremoeller commented May 4, 2018

Hi,

I also posted in Chainer repository, but I think this issue may be more accurate here: chainer/chainer#4709

My problem is: If I misuse MultiprocessParallelUpdater such that an exception is thrown, the python process does not exit, but just hangs.

The code to reproduce is at the bottom. It causes cupy.cuda.runtime.CUDARuntimeError: cudaErrorInitializationError: initialization error, but the python process does not exit.

I'd like the process to exit, but I am not sure how. Is this possible?

Thank you.

  • Conditions
    • Chainer version 4.0.0
    • CuPy version 4.0.0
    • OS/Platform ubuntu 16.04
    • CUDA/cuDNN version 9.0
  • Code to reproduce
import chainer
import chainer.functions as F
import chainer.links as L

class MLP(chainer.Chain):

    def __init__(self, n_units, n_out):
        super(MLP, self).__init__()
        with self.init_scope():
            # the size of the inputs to each layer will be inferred
            self.l1 = L.Linear(None, n_units)  # n_in -> n_units
            self.l2 = L.Linear(None, n_units)  # n_units -> n_units
            self.l3 = L.Linear(None, n_out)  # n_units -> n_out

    def __call__(self, x):
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        return self.l3(h2)

def train():
    train, test = chainer.datasets.get_mnist()

    batch_size = 64
    learning_rate = 0.05

    model = L.Classifier(MLP(1000, 10))

    optimizer = chainer.optimizers.MomentumSGD(learning_rate)
    optimizer.setup(model)
    optimizer.add_hook(chainer.optimizer.WeightDecay(5e-4))

    # Set up a trainer
    num_gpus = 2
    devices = range(num_gpus)

    # this is just to force the error
    chainer.cuda.get_device_from_id(0).use()

    train_iters = [chainer.iterators.MultiprocessIterator(i, batch_size, n_processes=num_gpus) \
                   for i in chainer.datasets.split_dataset_n_random(train, len(devices))]

    updater = training.updaters.MultiprocessParallelUpdater(train_iters, optimizer, devices=range(num_gpus))
    updater.setup_workers()


if __name__=="__main__":
    train()

Stacktrace:

algo-1_1  | Process _Worker-1:
algo-1_1  | Traceback (most recent call last):
algo-1_1  |   File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
algo-1_1  |     self.run()
algo-1_1  |   File "/usr/local/lib/python3.5/dist-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 45, in run
algo-1_1  |     dev.use()
algo-1_1  |   File "cupy/cuda/device.pyx", line 101, in cupy.cuda.device.Device.use
algo-1_1  |   File "cupy/cuda/device.pyx", line 107, in cupy.cuda.device.Device.use
algo-1_1  |   File "cupy/cuda/runtime.pyx", line 184, in cupy.cuda.runtime.setDevice
algo-1_1  |   File "cupy/cuda/runtime.pyx", line 136, in cupy.cuda.runtime.check_status
algo-1_1  | cupy.cuda.runtime.CUDARuntimeError: cudaErrorInitializationError: initialization error

I had a problem similar to this one, but with mpi4py / chainermn, so the answer might be similar: chainer/chainermn#236

@kmaehashi
Copy link
Member

Closing this issue as it has been discussed in chainer/chainer#4709 and fix for Chainer has merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants