-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU training errors #11
Comments
Hey @dhruvdcoder , I'm trying to run the multi-gpu stuff again and I am running into the exact same error on both 0.2.1 and 0.2.2 versions: -- Process 0 terminated with the following error: |
@MichalMalyska The code does use the "is_master" to make sure that it logs only on the main process. Do you see any other issue which can cause this error? Unfortunately, I won't be able to take a look into this until next week. Hopefully, after next week I will also have time to write some tests which use distributed training (albeit on CPU as github does not provide GPUs). |
@dhruvdcoder I have looked it over and I can't see why that could be the case. Is it possible that the wandb.init gets called not in the master process? |
Hi, this is just to report, but I faced a similar error using multi GPUs. My one was This indicates that the cause is that wandb is not initialized within the master process, even though that happens somewhere else (it is only after the initialization |
@MichalMalyska @masashi-y Thanks for the patience. I think I figured out the issue. I was under the impression that allennlp used fork to create new processes when performing distributed training. So the way I structured the code is: Init wandb -> let allennlp train command setup new processes -> let wandb callbacks do their work. However, this has multiple issues:
We cannot inject code to re-init wandb right after spawn is called. However, we could specialize the "from_partial_objects" method of the |
Every multi-gpu run I tried so far results in:
wandb.errors.error.Error: You must call wandb.init() before wandb.config.update
I think this is due to each process calling the update_config method which tries sending the config to wandb, if the self.config is not needed .
I thought a possible solution would be to pass the
is_master
argument from__call__
toupdate_config
wandb-allennlp/wandb_allennlp/training/callbacks/log_to_wandb.py
Line 41 in 9e6ba7f
and only log the config to wandb if is_master.
Full Trace:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.7/site-packages/allennlp/commands/train.py", line 443, in _train_worker
metrics = train_loop.run()
File "/opt/conda/lib/python3.7/site-packages/allennlp/commands/train.py", line 505, in run
return self.trainer.train()
File "/opt/conda/lib/python3.7/site-packages/allennlp/training/trainer.py", line 863, in train
callback(self, metrics={}, epoch=-1, is_master=self._master)
File "/opt/conda/lib/python3.7/site-packages/wandb_allennlp/training/callbacks/log_to_wandb.py", line 63, in call
self.update_config(trainer)
File "/opt/conda/lib/python3.7/site-packages/wandb_allennlp/training/callbacks/log_to_wandb.py", line 48, in update_config
self.wandb.config.update(self.config)
File "/opt/conda/lib/python3.7/site-packages/wandb/lib/preinit.py", line 29, in getattr
"You must call wandb.init() before {}.{}".format(self._name, key)
wandb.errors.error.Error: You must call wandb.init() before wandb.config.update
The text was updated successfully, but these errors were encountered: