New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#898 causes cuda initialization error #1054
Comments
@szhengac do you think you could isolate the problem into a MWE? |
Can reproduce with the following: import mxnet as mx
from gluonts.dataset.repository.datasets import get_dataset
from gluonts.model.simple_feedforward import SimpleFeedForwardEstimator
from gluonts.trainer import Trainer
dataset = get_dataset("electricity", regenerate=False)
estimator = SimpleFeedForwardEstimator(freq="H", prediction_length=24, trainer=Trainer(epochs=3, ctx=mx.gpu()))
predictor = estimator.train(dataset.train, num_workers=2) which fails on master but succeeds in c3150eb (i.e. prior to #898) |
The source of the error seems to be placing any data on the gpu at the import mxnet as mx
from gluonts.mx.batchify import batchify
from gluonts.dataset.loader import TrainDataLoader
from gluonts.dataset.repository.datasets import get_dataset
from functools import partial
from gluonts.transform import (
InstanceSplitter,
ExpectedNumInstanceSampler
)
from gluonts.dataset.field_names import FieldName
from gluonts.mx.batchify import batchify
from gluonts.dataset.loader import TrainDataLoader
with mx.gpu():
a =mx.nd.array([42])
dataset = get_dataset("electricity", regenerate=False)
transformation = InstanceSplitter(
target_field=FieldName.TARGET,
is_pad_field=FieldName.IS_PAD,
start_field=FieldName.START,
forecast_start_field=FieldName.FORECAST_START,
train_sampler=ExpectedNumInstanceSampler(num_instances=2),
past_length=24,
future_length=12)
batch_size = 32
num_batches_per_epoch = 300
data_loader = TrainDataLoader(dataset.train, batch_size=batch_size, stack_fn=partial(batchify, ctx=mx.gpu()),
transform=transformation,
num_batches_per_epoch=num_batches_per_epoch, num_workers=2) We see no error when restarting the kernel and commenting out: mx.gpu():
a =mx.nd.array([42]) Thus training like this works fine: from mxnet import init, gluon
import mxnet as mx
from gluonts.mx.batchify import batchify
from gluonts.dataset.loader import TrainDataLoader
from gluonts.dataset.repository.datasets import get_dataset
from functools import partial
from gluonts.transform import (
InstanceSplitter,
ExpectedNumInstanceSampler
)
from gluonts.dataset.field_names import FieldName
from gluonts.mx.batchify import batchify
from gluonts.dataset.loader import TrainDataLoader
dataset = get_dataset("electricity", regenerate=False)
transformation = InstanceSplitter(
target_field=FieldName.TARGET,
is_pad_field=FieldName.IS_PAD,
start_field=FieldName.START,
forecast_start_field=FieldName.FORECAST_START,
train_sampler=ExpectedNumInstanceSampler(num_instances=2),
past_length=24,
future_length=12)
batch_size = 32
num_batches_per_epoch = 300
data_loader = TrainDataLoader(dataset.train, batch_size=batch_size, stack_fn=partial(batchify, ctx=mx.gpu()),
transform=transformation,
num_batches_per_epoch=num_batches_per_epoch, num_workers=2)
from tqdm import tqdm
import time
from mxnet import autograd, gluon
epochs = 3
num_workers = 4
learning_rate = 0.001
weight_decay = 1e-8
class MyTrainNetwork(gluon.HybridBlock):
def __init__(self, prediction_length, **kwargs):
super().__init__(**kwargs)
self.prediction_length = prediction_length
with self.name_scope():
# Set up a 3 layer neural network that directly predicts the target values
self.nn = mx.gluon.nn.HybridSequential()
self.nn.add(mx.gluon.nn.Dense(units=40, activation='relu'))
self.nn.add(mx.gluon.nn.Dense(units=40, activation='relu'))
self.nn.add(mx.gluon.nn.Dense(units=self.prediction_length, activation='softrelu'))
def hybrid_forward(self, F, past_target, future_target):
prediction = self.nn(past_target)
# calculate L1 loss with the future_target to learn the median
return (prediction - future_target).abs().mean(axis=-1)
with mx.gpu():
net = MyTrainNetwork(prediction_length=12)
net.initialize(init=init.Xavier())
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})
for epoch_no in range(epochs):
# mark epoch start time
tic = time.time()
avg_epoch_loss = 0.0
with tqdm(data_loader) as it:
for batch_no, data_entry in enumerate(it, start=1):
inputs = mx.nd.array(data_entry["past_target"])
targets = mx.nd.array(data_entry["future_target"])
with autograd.record():
loss = mx.nd.mean(net(inputs, targets))
avg_epoch_loss += loss.asnumpy().item()
it.set_postfix(
ordered_dict={
"avg_epoch_loss": avg_epoch_loss / batch_no,
"epoch": epoch_no,
},
refresh=False,
)
n_iter = epoch_no*num_batches_per_epoch + batch_no
loss.backward()
trainer.step(batch_size)
# mark epoch end time and log time cost of current epoch
toc = time.time() Collecting some relevant links: Update: MFE import mxnet as mx
import multiprocessing as mp
mp.set_start_method("fork")
from multiprocessing import Process, Manager, Queue
with mx.gpu():
a = mx.nd.array([42])
num_workers = 3
def worker_fn(worder_id, generator, terminate_event, exhausted_event
):
with mx.gpu():
b = mx.nd.array([42])
exhausted_event.set()
manager = Manager()
exhausted_events = [
manager.Event() for _ in range(num_workers)
]
terminate_event = manager.Event()
def gen():
for i in range(100):
yield i*2
processes = []
for worker_id, event in enumerate(exhausted_events):
p = Process(
target=worker_fn,
args=(
worker_id,
gen(),
terminate_event,
event ),
)
p.start()
processes.append(p) |
It seems like |
So the problem is just caused by having something of the form
before forking the worker processes. But where is this happening in my example above? There I’m just calling |
This happens the estimator's # ensure that the training network is created within the same MXNet
# context as the one that will be used during training
with self.trainer.ctx:
trained_net = self.create_training_network()
self.trainer(
net=trained_net,
input_names=get_hybrid_forward_input_names(trained_net),
train_iter=training_data_loader,
validation_iter=validation_data_loader,
)
with self.trainer.ctx:
# ensure that the prediction network is created within the same MXNet
# context as the one that was used during training
return TrainOutput(
transformation=transformation,
trained_net=trained_net,
predictor=self.create_predictor(transformation, trained_net),
) |
@lostella I no longer work on that job. I think Konstantinos will work on that after he finishes other works. |
@benidis whenever you run jobs on GPU, could you follow up here to confirm that training works fine on |
Confirming that this fix works. |
I'll leave this open since for some reason prior to #898 this fix was not required, which I cannot really explain |
Closing this since multiprocessing data loader was removed in #2018 |
When I submitted a sagemaker job with the latest gluonts master, I received cuda initialization error in https://github.com/awslabs/gluon-ts/blob/3ff95ebcf70e39f6424d76b9b58bd861b36f7363/src/gluonts/mx/trainer/_base.py#L242. After I remove the commit #898, everything goes well.
The text was updated successfully, but these errors were encountered: