Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#898 causes cuda initialization error #1054

Closed
szhengac opened this issue Sep 25, 2020 · 12 comments
Closed

#898 causes cuda initialization error #1054

szhengac opened this issue Sep 25, 2020 · 12 comments
Labels
bug Something isn't working

Comments

@szhengac
Copy link
Contributor

szhengac commented Sep 25, 2020

When I submitted a sagemaker job with the latest gluonts master, I received cuda initialization error in https://github.com/awslabs/gluon-ts/blob/3ff95ebcf70e39f6424d76b9b58bd861b36f7363/src/gluonts/mx/trainer/_base.py#L242. After I remove the commit #898, everything goes well.

@szhengac szhengac added the bug Something isn't working label Sep 25, 2020
@lostella lostella changed the title #898 causes cuda initialization error in internal repo havel #898 causes cuda initialization error Sep 25, 2020
@lostella
Copy link
Contributor

@szhengac do you think you could isolate the problem into a MWE?

@lostella
Copy link
Contributor

Can reproduce with the following:

import mxnet as mx

from gluonts.dataset.repository.datasets import get_dataset
from gluonts.model.simple_feedforward import SimpleFeedForwardEstimator
from gluonts.trainer import Trainer

dataset = get_dataset("electricity", regenerate=False)

estimator = SimpleFeedForwardEstimator(freq="H", prediction_length=24, trainer=Trainer(epochs=3, ctx=mx.gpu()))

predictor = estimator.train(dataset.train, num_workers=2)

which fails on master but succeeds in c3150eb (i.e. prior to #898)

@PascalIversen
Copy link
Contributor

PascalIversen commented Sep 30, 2020

The source of the error seems to be placing any data on the gpu at the with mx.gpu() statement.
This reproduces the error:

import mxnet as mx
from gluonts.mx.batchify import batchify
from gluonts.dataset.loader import TrainDataLoader
from gluonts.dataset.repository.datasets import get_dataset
from functools import partial
from gluonts.transform import (
    InstanceSplitter,
    ExpectedNumInstanceSampler
)
from gluonts.dataset.field_names import FieldName
from gluonts.mx.batchify import batchify 
from gluonts.dataset.loader import TrainDataLoader


with mx.gpu():
    a =mx.nd.array([42])
    
dataset = get_dataset("electricity", regenerate=False)



transformation = InstanceSplitter(
                target_field=FieldName.TARGET,
                is_pad_field=FieldName.IS_PAD,
                start_field=FieldName.START,
                forecast_start_field=FieldName.FORECAST_START,
                train_sampler=ExpectedNumInstanceSampler(num_instances=2),
                past_length=24,
                future_length=12)



batch_size = 32
num_batches_per_epoch = 300

data_loader = TrainDataLoader(dataset.train, batch_size=batch_size, stack_fn=partial(batchify, ctx=mx.gpu()),
                              transform=transformation,
                              num_batches_per_epoch=num_batches_per_epoch, num_workers=2)

We see no error when restarting the kernel and commenting out:

mx.gpu():
    a =mx.nd.array([42])

Thus training like this works fine:

from mxnet import init, gluon
import mxnet as mx
from gluonts.mx.batchify import batchify
from gluonts.dataset.loader import TrainDataLoader
from gluonts.dataset.repository.datasets import get_dataset
from functools import partial
from gluonts.transform import (
    InstanceSplitter,
    ExpectedNumInstanceSampler
)
from gluonts.dataset.field_names import FieldName
from gluonts.mx.batchify import batchify 
from gluonts.dataset.loader import TrainDataLoader

    
dataset = get_dataset("electricity", regenerate=False)



transformation = InstanceSplitter(
                target_field=FieldName.TARGET,
                is_pad_field=FieldName.IS_PAD,
                start_field=FieldName.START,
                forecast_start_field=FieldName.FORECAST_START,
                train_sampler=ExpectedNumInstanceSampler(num_instances=2),
                past_length=24,
                future_length=12)



batch_size = 32
num_batches_per_epoch = 300

data_loader = TrainDataLoader(dataset.train, batch_size=batch_size, stack_fn=partial(batchify, ctx=mx.gpu()),
                              transform=transformation,
                              num_batches_per_epoch=num_batches_per_epoch, num_workers=2)

from tqdm import tqdm
import time
from mxnet import autograd, gluon
epochs = 3

num_workers = 4
learning_rate = 0.001
weight_decay = 1e-8

class MyTrainNetwork(gluon.HybridBlock):
    def __init__(self, prediction_length, **kwargs):
        super().__init__(**kwargs)
        self.prediction_length = prediction_length

        with self.name_scope():
            # Set up a 3 layer neural network that directly predicts the target values
            self.nn = mx.gluon.nn.HybridSequential()
            self.nn.add(mx.gluon.nn.Dense(units=40, activation='relu'))
            self.nn.add(mx.gluon.nn.Dense(units=40, activation='relu'))
            self.nn.add(mx.gluon.nn.Dense(units=self.prediction_length, activation='softrelu'))

    def hybrid_forward(self, F, past_target, future_target):
        prediction = self.nn(past_target)
        # calculate L1 loss with the future_target to learn the median
        return (prediction - future_target).abs().mean(axis=-1)



with mx.gpu():
    net = MyTrainNetwork(prediction_length=12)

    net.initialize(init=init.Xavier())
    trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})

    for epoch_no in range(epochs):

        # mark epoch start time
        tic = time.time()
        avg_epoch_loss = 0.0

        with tqdm(data_loader) as it:
            for batch_no, data_entry in enumerate(it, start=1):

                inputs = mx.nd.array(data_entry["past_target"])
                targets = mx.nd.array(data_entry["future_target"])
                with autograd.record():
                    loss = mx.nd.mean(net(inputs, targets))
                avg_epoch_loss += loss.asnumpy().item()
                it.set_postfix(
                    ordered_dict={
                        "avg_epoch_loss": avg_epoch_loss / batch_no,
                        "epoch": epoch_no,
                    },
                    refresh=False,
                )
                n_iter = epoch_no*num_batches_per_epoch + batch_no
                loss.backward()

                trainer.step(batch_size)

            # mark epoch end time and log time cost of current epoch
            toc = time.time()

Collecting some relevant links:
apache/mxnet#17826
pytorch/pytorch#2517
pytorch/pytorch#21092
apache/mxnet#4659

Update: MFE

import mxnet as mx

import multiprocessing as mp
mp.set_start_method("fork")
from multiprocessing import Process, Manager, Queue

with mx.gpu():
    a = mx.nd.array([42])

num_workers = 3



def worker_fn(worder_id, generator, terminate_event, exhausted_event


):
    with mx.gpu():
        b = mx.nd.array([42])

    exhausted_event.set()
    
manager = Manager()

exhausted_events = [
            manager.Event() for _ in range(num_workers)
        ]
terminate_event = manager.Event()

def gen():
    for i in range(100):
        yield i*2

processes = []
for worker_id, event in enumerate(exhausted_events):
        p = Process(
            target=worker_fn,
           args=(
                worker_id,
                gen(),
                terminate_event,
                event            ),
        )
        p.start()
        processes.append(p)

@PascalIversen
Copy link
Contributor

It seems like mxnet does not support the way we are currently employ multiprocessing.
See:
apache/mxnet#19291
A way out would be to use mp.set_start_method("spawn"), which however does not support passing generators to workers.

@lostella
Copy link
Contributor

lostella commented Oct 6, 2020

So the problem is just caused by having something of the form

with mx.gpu():
    a = mx.nd.array([42])

before forking the worker processes. But where is this happening in my example above? There I’m just calling train, which invokes the data loader

@PascalIversen
Copy link
Contributor

PascalIversen commented Oct 6, 2020

This happens the estimator's train_model method:

       # ensure that the training network is created within the same MXNet
       # context as the one that will be used during training
        with self.trainer.ctx:
            trained_net = self.create_training_network()

        self.trainer(
            net=trained_net,
            input_names=get_hybrid_forward_input_names(trained_net),
            train_iter=training_data_loader,
            validation_iter=validation_data_loader,
        )

        with self.trainer.ctx:
            # ensure that the prediction network is created within the same MXNet
            # context as the one that was used during training
            return TrainOutput(
                transformation=transformation,
                trained_net=trained_net,
                predictor=self.create_predictor(transformation, trained_net),
            )

@lostella
Copy link
Contributor

lostella commented Oct 9, 2020

@szhengac after having merged #1080, the problem should be solved if you set the following at the beginning of your jobs

import multiprocessing
multiprocessing.set_start_method("spawn", force=True)

Could you verify that?

@szhengac
Copy link
Contributor Author

@lostella I no longer work on that job. I think Konstantinos will work on that after he finishes other works.

@lostella
Copy link
Contributor

@benidis whenever you run jobs on GPU, could you follow up here to confirm that training works fine on master?

@benidis
Copy link
Contributor

benidis commented Nov 6, 2020

Confirming that this fix works.

@lostella
Copy link
Contributor

lostella commented Nov 6, 2020

I'll leave this open since for some reason prior to #898 this fix was not required, which I cannot really explain

@lostella
Copy link
Contributor

Closing this since multiprocessing data loader was removed in #2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants