Multi-GPU training errors #11

MichalMalyska · 2020-10-14T20:42:34Z

Every multi-gpu run I tried so far results in:

wandb.errors.error.Error: You must call wandb.init() before wandb.config.update

I think this is due to each process calling the update_config method which tries sending the config to wandb, if the self.config is not needed .

I thought a possible solution would be to pass the is_master argument from __call__ to update_config

wandb-allennlp/wandb_allennlp/training/callbacks/log_to_wandb.py

Line 41 in 9e6ba7f

def update_config(self, trainer: GradientDescentTrainer) -> None:

and only log the config to wandb if is_master.

Full Trace:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.7/site-packages/allennlp/commands/train.py", line 443, in _train_worker
metrics = train_loop.run()
File "/opt/conda/lib/python3.7/site-packages/allennlp/commands/train.py", line 505, in run
return self.trainer.train()
File "/opt/conda/lib/python3.7/site-packages/allennlp/training/trainer.py", line 863, in train
callback(self, metrics={}, epoch=-1, is_master=self._master)
File "/opt/conda/lib/python3.7/site-packages/wandb_allennlp/training/callbacks/log_to_wandb.py", line 63, in call
self.update_config(trainer)
File "/opt/conda/lib/python3.7/site-packages/wandb_allennlp/training/callbacks/log_to_wandb.py", line 48, in update_config
self.wandb.config.update(self.config)
File "/opt/conda/lib/python3.7/site-packages/wandb/lib/preinit.py", line 29, in getattr
"You must call wandb.init() before {}.{}".format(self._name, key)
wandb.errors.error.Error: You must call wandb.init() before wandb.config.update

The text was updated successfully, but these errors were encountered:

MichalMalyska · 2020-11-28T00:29:34Z

Hey @dhruvdcoder , I'm trying to run the multi-gpu stuff again and I am running into the exact same error on both 0.2.1 and 0.2.2 versions:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.7/site-packages/allennlp/commands/train.py", line 466, in _train_worker
metrics = train_loop.run()
File "/opt/conda/lib/python3.7/site-packages/allennlp/commands/train.py", line 528, in run
return self.trainer.train()
File "/opt/conda/lib/python3.7/site-packages/allennlp/training/trainer.py", line 966, in train
return self._try_train()
File "/opt/conda/lib/python3.7/site-packages/allennlp/training/trainer.py", line 997, in _try_train
callback(self, metrics={}, epoch=-1, is_master=self._master)
File "/opt/conda/lib/python3.7/site-packages/wandb_allennlp/training/callbacks/log_to_wandb.py", line 63, in call
self.update_config(trainer)
File "/opt/conda/lib/python3.7/site-packages/wandb_allennlp/training/callbacks/log_to_wandb.py", line 48, in update_config
self.wandb.config.update(self.config)
File "/opt/conda/lib/python3.7/site-packages/wandb/lib/preinit.py", line 29, in getattr
"You must call wandb.init() before {}.{}".format(self._name, key)
wandb.errors.error.Error: You must call wandb.init() before wandb.config.update

dhruvdcoder · 2020-12-01T03:15:03Z

@MichalMalyska The code does use the "is_master" to make sure that it logs only on the main process. Do you see any other issue which can cause this error? Unfortunately, I won't be able to take a look into this until next week. Hopefully, after next week I will also have time to write some tests which use distributed training (albeit on CPU as github does not provide GPUs).

MichalMalyska · 2020-12-02T15:16:10Z

@dhruvdcoder I have looked it over and I can't see why that could be the case. Is it possible that the wandb.init gets called not in the master process?

masashi-y · 2020-12-17T12:27:54Z

Hi, this is just to report, but I faced a similar error using multi GPUs. My one was AttributeError: 'function' object has no attribute 'update', which occurred here, using wandb==0.10.12.

This indicates that the cause is that wandb is not initialized within the master process, even though that happens somewhere else (it is only after the initialization wandb.config turns into a dict-like object from function).

dhruvdcoder · 2020-12-18T23:46:40Z

@MichalMalyska @masashi-y Thanks for the patience. I think I figured out the issue. I was under the impression that allennlp used fork to create new processes when performing distributed training. So the way I structured the code is:

Init wandb -> let allennlp train command setup new processes -> let wandb callbacks do their work.

However, this has multiple issues:

It will only work if fork is used to create new processes. However, allennlp uses spawn hence the wandb object created through wandb.init in the main process needs to be re-created after the call to spawn (see: https://github.com/allenai/allennlp/blob/master/allennlp/commands/train.py#L406).
Even if fork is used, it will not work correctly for multi-node training where you would explicitly launch training on difference nodes and supply correct node rank to each launch.

We cannot inject code to re-init wandb right after spawn is called. However, we could specialize the "from_partial_objects" method of the TrainModel class to initialize wandb only on the global master process using torch.distributed.get_rank. I think this will work in all distributed settings.

dhruvdcoder added the bug Something isn't working label Oct 14, 2020

dhruvdcoder mentioned this issue Oct 14, 2020

Check if master before sending config to wandb #12

Merged

dhruvdcoder closed this as completed in #12 Oct 15, 2020

dhruvdcoder reopened this Dec 18, 2020

MichalMalyska closed this as completed Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training errors #11

Multi-GPU training errors #11

MichalMalyska commented Oct 14, 2020 •

edited

Loading

MichalMalyska commented Nov 28, 2020

dhruvdcoder commented Dec 1, 2020

MichalMalyska commented Dec 2, 2020

masashi-y commented Dec 17, 2020

dhruvdcoder commented Dec 18, 2020

Multi-GPU training errors #11

Multi-GPU training errors #11

Comments

MichalMalyska commented Oct 14, 2020 • edited Loading

MichalMalyska commented Nov 28, 2020

dhruvdcoder commented Dec 1, 2020

MichalMalyska commented Dec 2, 2020

masashi-y commented Dec 17, 2020

dhruvdcoder commented Dec 18, 2020

MichalMalyska commented Oct 14, 2020 •

edited

Loading