#898 causes cuda initialization error #1054

szhengac · 2020-09-25T06:21:30Z

When I submitted a sagemaker job with the latest gluonts master, I received cuda initialization error in https://github.com/awslabs/gluon-ts/blob/3ff95ebcf70e39f6424d76b9b58bd861b36f7363/src/gluonts/mx/trainer/_base.py#L242. After I remove the commit #898, everything goes well.

lostella · 2020-09-25T09:23:03Z

@szhengac do you think you could isolate the problem into a MWE?

lostella · 2020-09-28T18:40:38Z

Can reproduce with the following:

import mxnet as mx

from gluonts.dataset.repository.datasets import get_dataset
from gluonts.model.simple_feedforward import SimpleFeedForwardEstimator
from gluonts.trainer import Trainer

dataset = get_dataset("electricity", regenerate=False)

estimator = SimpleFeedForwardEstimator(freq="H", prediction_length=24, trainer=Trainer(epochs=3, ctx=mx.gpu()))

predictor = estimator.train(dataset.train, num_workers=2)

which fails on master but succeeds in c3150eb (i.e. prior to #898)

PascalIversen · 2020-09-30T08:51:32Z

The source of the error seems to be placing any data on the gpu at the with mx.gpu() statement.
This reproduces the error:

import mxnet as mx
from gluonts.mx.batchify import batchify
from gluonts.dataset.loader import TrainDataLoader
from gluonts.dataset.repository.datasets import get_dataset
from functools import partial
from gluonts.transform import (
    InstanceSplitter,
    ExpectedNumInstanceSampler
)
from gluonts.dataset.field_names import FieldName
from gluonts.mx.batchify import batchify 
from gluonts.dataset.loader import TrainDataLoader


with mx.gpu():
    a =mx.nd.array([42])
    
dataset = get_dataset("electricity", regenerate=False)



transformation = InstanceSplitter(
                target_field=FieldName.TARGET,
                is_pad_field=FieldName.IS_PAD,
                start_field=FieldName.START,
                forecast_start_field=FieldName.FORECAST_START,
                train_sampler=ExpectedNumInstanceSampler(num_instances=2),
                past_length=24,
                future_length=12)



batch_size = 32
num_batches_per_epoch = 300

data_loader = TrainDataLoader(dataset.train, batch_size=batch_size, stack_fn=partial(batchify, ctx=mx.gpu()),
                              transform=transformation,
                              num_batches_per_epoch=num_batches_per_epoch, num_workers=2)

We see no error when restarting the kernel and commenting out:

mx.gpu():
    a =mx.nd.array([42])

Thus training like this works fine:

from mxnet import init, gluon
import mxnet as mx
from gluonts.mx.batchify import batchify
from gluonts.dataset.loader import TrainDataLoader
from gluonts.dataset.repository.datasets import get_dataset
from functools import partial
from gluonts.transform import (
    InstanceSplitter,
    ExpectedNumInstanceSampler
)
from gluonts.dataset.field_names import FieldName
from gluonts.mx.batchify import batchify 
from gluonts.dataset.loader import TrainDataLoader

    
dataset = get_dataset("electricity", regenerate=False)



transformation = InstanceSplitter(
                target_field=FieldName.TARGET,
                is_pad_field=FieldName.IS_PAD,
                start_field=FieldName.START,
                forecast_start_field=FieldName.FORECAST_START,
                train_sampler=ExpectedNumInstanceSampler(num_instances=2),
                past_length=24,
                future_length=12)



batch_size = 32
num_batches_per_epoch = 300

data_loader = TrainDataLoader(dataset.train, batch_size=batch_size, stack_fn=partial(batchify, ctx=mx.gpu()),
                              transform=transformation,
                              num_batches_per_epoch=num_batches_per_epoch, num_workers=2)

from tqdm import tqdm
import time
from mxnet import autograd, gluon
epochs = 3

num_workers = 4
learning_rate = 0.001
weight_decay = 1e-8

class MyTrainNetwork(gluon.HybridBlock):
    def __init__(self, prediction_length, **kwargs):
        super().__init__(**kwargs)
        self.prediction_length = prediction_length

        with self.name_scope():
            # Set up a 3 layer neural network that directly predicts the target values
            self.nn = mx.gluon.nn.HybridSequential()
            self.nn.add(mx.gluon.nn.Dense(units=40, activation='relu'))
            self.nn.add(mx.gluon.nn.Dense(units=40, activation='relu'))
            self.nn.add(mx.gluon.nn.Dense(units=self.prediction_length, activation='softrelu'))

    def hybrid_forward(self, F, past_target, future_target):
        prediction = self.nn(past_target)
        # calculate L1 loss with the future_target to learn the median
        return (prediction - future_target).abs().mean(axis=-1)



with mx.gpu():
    net = MyTrainNetwork(prediction_length=12)

    net.initialize(init=init.Xavier())
    trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})

    for epoch_no in range(epochs):

        # mark epoch start time
        tic = time.time()
        avg_epoch_loss = 0.0

        with tqdm(data_loader) as it:
            for batch_no, data_entry in enumerate(it, start=1):

                inputs = mx.nd.array(data_entry["past_target"])
                targets = mx.nd.array(data_entry["future_target"])
                with autograd.record():
                    loss = mx.nd.mean(net(inputs, targets))
                avg_epoch_loss += loss.asnumpy().item()
                it.set_postfix(
                    ordered_dict={
                        "avg_epoch_loss": avg_epoch_loss / batch_no,
                        "epoch": epoch_no,
                    },
                    refresh=False,
                )
                n_iter = epoch_no*num_batches_per_epoch + batch_no
                loss.backward()

                trainer.step(batch_size)

            # mark epoch end time and log time cost of current epoch
            toc = time.time()

Collecting some relevant links:
apache/mxnet#17826
pytorch/pytorch#2517
pytorch/pytorch#21092
apache/mxnet#4659

Update: MFE

import mxnet as mx

import multiprocessing as mp
mp.set_start_method("fork")
from multiprocessing import Process, Manager, Queue

with mx.gpu():
    a = mx.nd.array([42])

num_workers = 3



def worker_fn(worder_id, generator, terminate_event, exhausted_event


):
    with mx.gpu():
        b = mx.nd.array([42])

    exhausted_event.set()
    
manager = Manager()

exhausted_events = [
            manager.Event() for _ in range(num_workers)
        ]
terminate_event = manager.Event()

def gen():
    for i in range(100):
        yield i*2

processes = []
for worker_id, event in enumerate(exhausted_events):
        p = Process(
            target=worker_fn,
           args=(
                worker_id,
                gen(),
                terminate_event,
                event            ),
        )
        p.start()
        processes.append(p)

PascalIversen · 2020-10-05T16:49:47Z

It seems like mxnet does not support the way we are currently employ multiprocessing.
See:
apache/mxnet#19291
A way out would be to use mp.set_start_method("spawn"), which however does not support passing generators to workers.

lostella · 2020-10-06T05:42:19Z

So the problem is just caused by having something of the form

with mx.gpu():
    a = mx.nd.array([42])

before forking the worker processes. But where is this happening in my example above? There I’m just calling train, which invokes the data loader

PascalIversen · 2020-10-06T08:23:47Z

This happens the estimator's train_model method:

       # ensure that the training network is created within the same MXNet
       # context as the one that will be used during training
        with self.trainer.ctx:
            trained_net = self.create_training_network()

        self.trainer(
            net=trained_net,
            input_names=get_hybrid_forward_input_names(trained_net),
            train_iter=training_data_loader,
            validation_iter=validation_data_loader,
        )

        with self.trainer.ctx:
            # ensure that the prediction network is created within the same MXNet
            # context as the one that was used during training
            return TrainOutput(
                transformation=transformation,
                trained_net=trained_net,
                predictor=self.create_predictor(transformation, trained_net),
            )

lostella · 2020-10-09T12:49:20Z

@szhengac after having merged #1080, the problem should be solved if you set the following at the beginning of your jobs

import multiprocessing
multiprocessing.set_start_method("spawn", force=True)

Could you verify that?

szhengac · 2020-10-13T18:47:23Z

@lostella I no longer work on that job. I think Konstantinos will work on that after he finishes other works.

lostella · 2020-10-13T21:44:28Z

@benidis whenever you run jobs on GPU, could you follow up here to confirm that training works fine on master?

benidis · 2020-11-06T21:47:05Z

Confirming that this fix works.

lostella · 2020-11-06T22:16:05Z

I'll leave this open since for some reason prior to #898 this fix was not required, which I cannot really explain

lostella · 2022-07-17T12:34:05Z

Closing this since multiprocessing data loader was removed in #2018

szhengac added the bug Something isn't working label Sep 25, 2020

lostella changed the title ~~#898 causes cuda initialization error in internal repo havel~~ #898 causes cuda initialization error Sep 25, 2020

lostella mentioned this issue Oct 9, 2020

Refactor multiprocessing batcher to work with spawn method #1080

Merged

k-user mentioned this issue Oct 22, 2020

Was it normal not to be able to run out of all epochs? #1103

Closed

PascalIversen mentioned this issue Nov 9, 2020

When I use the #898 code, there are some problems #941

Closed

arangatang added a commit to arangatang/gluon-ts that referenced this issue Mar 20, 2021

Added changes from awslabs#1054 to shell

8733d2d

arangatang mentioned this issue Mar 23, 2021

Fix dockerfiles #1376

Merged

pbruneau mentioned this issue Jun 24, 2021

num_workers argument to Estimator.train causes CUDA initialization error #1587

Closed

cutiezang3l mentioned this issue Jan 16, 2022

@szhengac after having merged #1080, the problem should be solved if you set the following at the beginning of your jobs awslabs/aws-shell#264

Open

lostella closed this as completed Jul 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#898 causes cuda initialization error #1054

#898 causes cuda initialization error #1054

szhengac commented Sep 25, 2020 •

edited by lostella

lostella commented Sep 25, 2020

lostella commented Sep 28, 2020

PascalIversen commented Sep 30, 2020 •

edited

PascalIversen commented Oct 5, 2020

lostella commented Oct 6, 2020

PascalIversen commented Oct 6, 2020 •

edited

lostella commented Oct 9, 2020 •

edited

szhengac commented Oct 13, 2020

lostella commented Oct 13, 2020

benidis commented Nov 6, 2020

lostella commented Nov 6, 2020

lostella commented Jul 17, 2022

#898 causes cuda initialization error #1054

#898 causes cuda initialization error #1054

Comments

szhengac commented Sep 25, 2020 • edited by lostella

lostella commented Sep 25, 2020

lostella commented Sep 28, 2020

PascalIversen commented Sep 30, 2020 • edited

PascalIversen commented Oct 5, 2020

lostella commented Oct 6, 2020

PascalIversen commented Oct 6, 2020 • edited

lostella commented Oct 9, 2020 • edited

szhengac commented Oct 13, 2020

lostella commented Oct 13, 2020

benidis commented Nov 6, 2020

lostella commented Nov 6, 2020

lostella commented Jul 17, 2022

szhengac commented Sep 25, 2020 •

edited by lostella

PascalIversen commented Sep 30, 2020 •

edited

PascalIversen commented Oct 6, 2020 •

edited

lostella commented Oct 9, 2020 •

edited