-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] hydra with torch.distributed.launch #2038
Comments
The issue is that |
The problem extends much deeper than to a conflict in the argument space. |
Thanks for the insight @jbaczek. That is a good point about being unable to use a custom launcher without also using a custom sweeper + multirun mode. A feature request r.e. support for custom launching via single-run mode would be most welcome. |
I hit another wall with calling torch distributed as a launcher. Launcher is called in the hydra's main wrapper. I wanted to pass original function to the multiprocessing context. This pickles the function to call together with arguments (btw Line 22 in 796fdf1
Popen . But here comes the issue: Pickle serializes only metadata about the function and then restores it from a code. But at this point hydra's main wrapper already has overwritten main function, so pickled function does not match function that it tries to restore. Here is a repro:
import functools
import pickle
def decorate(f):
@functools.wraps(f)
def wrapper(*args, **kwargs):
print(pickle.dumps(f))
return f(*args, **kwargs)
return wrapper
@decorate
def main():
print('A')
if __name__ == '__main__':
main()
I'm still looking into the issue. Any hints on how to solve this are welcome. |
you can use --use_env to shield |
@Jasha10 I have implemented a working plugin for this. Would you be interested in MR? |
@jbaczek I'd certainly be interested to take a look at the code you've written. I can't promise that we'd merge a PR (I'd have to discuss it with the Hydra team). |
hi, i was wondering if this plugin is available now since i cannot any doc on the hydra website. |
hi @zhaoedf |
since contrib is not available on pypi, does it mean that i need to download the code in the link you provided and manually install it into my conda env? |
@zhaoedf that is correct. |
for anyone might be interested, you can use --use-env to pass the local_rank param without causing a conflict.
|
For anyone who is struggling, just simply use |
I've created an updated version of the hydra-torchrun-launcher plugin at https://github.com/acherstyx/hydra-torchrun-launcher. I've resolved the PicklingError by utilizing cloudpickle. Now, this version of the launcher should work exactly like launching with |
🚀 Feature Request
Motivation
I want to use hydra with
torch.distributed.launch
for multi-node multi-GPU training. The problem is thetorch.distributed.launch
module will automatically pass alocal_rank
argument to the script thus leading tounrecognized arguments: --local_rank
.In this case, how should I program my customized file accordingly to accept this appended argument?
Currently, it seems command arguments in the format of
--some_argument
will be directly received by hydra.The text was updated successfully, but these errors were encountered: