Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training errors #11

Closed
MichalMalyska opened this issue Oct 14, 2020 · 5 comments · Fixed by #12
Closed

Multi-GPU training errors #11

MichalMalyska opened this issue Oct 14, 2020 · 5 comments · Fixed by #12
Labels
bug Something isn't working

Comments

@MichalMalyska
Copy link

MichalMalyska commented Oct 14, 2020

Every multi-gpu run I tried so far results in:

wandb.errors.error.Error: You must call wandb.init() before wandb.config.update

I think this is due to each process calling the update_config method which tries sending the config to wandb, if the self.config is not needed .

I thought a possible solution would be to pass the is_master argument from __call__ to update_config

def update_config(self, trainer: GradientDescentTrainer) -> None:

and only log the config to wandb if is_master.

Full Trace:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.7/site-packages/allennlp/commands/train.py", line 443, in _train_worker
metrics = train_loop.run()
File "/opt/conda/lib/python3.7/site-packages/allennlp/commands/train.py", line 505, in run
return self.trainer.train()
File "/opt/conda/lib/python3.7/site-packages/allennlp/training/trainer.py", line 863, in train
callback(self, metrics={}, epoch=-1, is_master=self._master)
File "/opt/conda/lib/python3.7/site-packages/wandb_allennlp/training/callbacks/log_to_wandb.py", line 63, in call
self.update_config(trainer)
File "/opt/conda/lib/python3.7/site-packages/wandb_allennlp/training/callbacks/log_to_wandb.py", line 48, in update_config
self.wandb.config.update(self.config)
File "/opt/conda/lib/python3.7/site-packages/wandb/lib/preinit.py", line 29, in getattr
"You must call wandb.init() before {}.{}".format(self._name, key)
wandb.errors.error.Error: You must call wandb.init() before wandb.config.update

@MichalMalyska
Copy link
Author

Hey @dhruvdcoder , I'm trying to run the multi-gpu stuff again and I am running into the exact same error on both 0.2.1 and 0.2.2 versions:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.7/site-packages/allennlp/commands/train.py", line 466, in _train_worker
metrics = train_loop.run()
File "/opt/conda/lib/python3.7/site-packages/allennlp/commands/train.py", line 528, in run
return self.trainer.train()
File "/opt/conda/lib/python3.7/site-packages/allennlp/training/trainer.py", line 966, in train
return self._try_train()
File "/opt/conda/lib/python3.7/site-packages/allennlp/training/trainer.py", line 997, in _try_train
callback(self, metrics={}, epoch=-1, is_master=self._master)
File "/opt/conda/lib/python3.7/site-packages/wandb_allennlp/training/callbacks/log_to_wandb.py", line 63, in call
self.update_config(trainer)
File "/opt/conda/lib/python3.7/site-packages/wandb_allennlp/training/callbacks/log_to_wandb.py", line 48, in update_config
self.wandb.config.update(self.config)
File "/opt/conda/lib/python3.7/site-packages/wandb/lib/preinit.py", line 29, in getattr
"You must call wandb.init() before {}.{}".format(self._name, key)
wandb.errors.error.Error: You must call wandb.init() before wandb.config.update

@dhruvdcoder
Copy link
Owner

@MichalMalyska The code does use the "is_master" to make sure that it logs only on the main process. Do you see any other issue which can cause this error? Unfortunately, I won't be able to take a look into this until next week. Hopefully, after next week I will also have time to write some tests which use distributed training (albeit on CPU as github does not provide GPUs).

@MichalMalyska
Copy link
Author

@dhruvdcoder I have looked it over and I can't see why that could be the case. Is it possible that the wandb.init gets called not in the master process?

@masashi-y
Copy link

Hi, this is just to report, but I faced a similar error using multi GPUs. My one was AttributeError: 'function' object has no attribute 'update', which occurred here, using wandb==0.10.12.

This indicates that the cause is that wandb is not initialized within the master process, even though that happens somewhere else (it is only after the initialization wandb.config turns into a dict-like object from function).

@dhruvdcoder
Copy link
Owner

@MichalMalyska @masashi-y Thanks for the patience. I think I figured out the issue. I was under the impression that allennlp used fork to create new processes when performing distributed training. So the way I structured the code is:

Init wandb -> let allennlp train command setup new processes -> let wandb callbacks do their work.

However, this has multiple issues:

  1. It will only work if fork is used to create new processes. However, allennlp uses spawn hence the wandb object created through wandb.init in the main process needs to be re-created after the call to spawn (see: https://github.com/allenai/allennlp/blob/master/allennlp/commands/train.py#L406).
  2. Even if fork is used, it will not work correctly for multi-node training where you would explicitly launch training on difference nodes and supply correct node rank to each launch.

We cannot inject code to re-init wandb right after spawn is called. However, we could specialize the "from_partial_objects" method of the TrainModel class to initialize wandb only on the global master process using torch.distributed.get_rank. I think this will work in all distributed settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants