Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] hydra with torch.distributed.launch #2038

Closed
18445864529 opened this issue Feb 17, 2022 · 14 comments · Fixed by #2119
Closed

[Feature Request] hydra with torch.distributed.launch #2038

18445864529 opened this issue Feb 17, 2022 · 14 comments · Fixed by #2119
Labels
question Hydra usage question

Comments

@18445864529
Copy link

🚀 Feature Request

Motivation

I want to use hydra with torch.distributed.launch for multi-node multi-GPU training. The problem is the torch.distributed.launch module will automatically pass a local_rank argument to the script thus leading to unrecognized arguments: --local_rank.
In this case, how should I program my customized file accordingly to accept this appended argument?
Currently, it seems command arguments in the format of --some_argument will be directly received by hydra.

@18445864529 18445864529 added the enhancement Enhanvement request label Feb 17, 2022
@Jasha10
Copy link
Collaborator

Jasha10 commented Feb 17, 2022

The issue is that @hydra.main reads from sys.argv. Since torch.distributed is using sys.argv too, there is a conflict.
The most flexible solution is to use the compose API.

@Jasha10 Jasha10 added question Hydra usage question and removed enhancement Enhanvement request labels Feb 17, 2022
@jbaczek
Copy link
Contributor

jbaczek commented Mar 7, 2022

The problem extends much deeper than to a conflict in the argument space.
First of all torch.distributed.launch is deprecated. You should use torch.distributed.run which passes all arguments by an environment. Documentation says that there is a wrapper around torch.distributed.run named torchrun. Please refer to the documentation: https://pytorch.org/docs/stable/elastic/run.html
Second thing is that torch.distributed works by spawning multiple processes in which it launches instances of your script. This means that it will launch hydra multiple times creating (often conflicting) output dirs.
The solution which I am investigating is to spawn subprocesses after calling hydra. According to this doc: https://pytorch.org/docs/stable/elastic/customization.html it is possible to easily spawn desired processes from python script and to call function, not a whole script.
Thus I wanted to implement a custom launcher. I read the implementation of joblib launcher and example_launcher from hydra's repo. But it looks like currently the only way to use launcher is to do this via a sweeper, and only in multirun setup. I consider my distributed run a single job. I want to have an ability to launch my jobs however I want in single run setup.

@Jasha10
Copy link
Collaborator

Jasha10 commented Mar 7, 2022

Thanks for the insight @jbaczek.

That is a good point about being unable to use a custom launcher without also using a custom sweeper + multirun mode. A feature request r.e. support for custom launching via single-run mode would be most welcome.

@jbaczek
Copy link
Contributor

jbaczek commented Mar 10, 2022

I hit another wall with calling torch distributed as a launcher. Launcher is called in the hydra's main wrapper. I wanted to pass original function to the multiprocessing context. This pickles the function to call together with arguments (btw OmegaConf.register_new_resolver is unpicklable because it creates local function in runtime. And this is in a singleton state EDIT: This is apparently a known issue

# Plugins can cause issues for pickling the singleton state
) and passes it to Popen. But here comes the issue: Pickle serializes only metadata about the function and then restores it from a code. But at this point hydra's main wrapper already has overwritten main function, so pickled function does not match function that it tries to restore. Here is a repro:

import functools                  
import pickle                     
                                  
def decorate(f):                  
    @functools.wraps(f)           
    def wrapper(*args, **kwargs): 
        print(pickle.dumps(f))    
        return f(*args, **kwargs) 
    return wrapper                
                                  
@decorate                         
def main():                       
    print('A')                    
                                  
if __name__ == '__main__':        
    main()                        
$ python tmp.py
Traceback (most recent call last):
...
_pickle.PicklingError: Can't pickle <function main at 0x7ff1df907ee0>: it's not the same object as __main__.main

I'm still looking into the issue. Any hints on how to solve this are welcome.
EDIT: Managed to solve it by using fork as a starting method, instead of spawn

@longmao-yiran
Copy link

you can use --use_env to shield arguments: --local_rank

@jbaczek
Copy link
Contributor

jbaczek commented Mar 15, 2022

@Jasha10 I have implemented a working plugin for this. Would you be interested in MR?

@Jasha10
Copy link
Collaborator

Jasha10 commented Mar 17, 2022

@jbaczek I'd certainly be interested to take a look at the code you've written. I can't promise that we'd merge a PR (I'd have to discuss it with the Hydra team).

@zhaoedf
Copy link

zhaoedf commented Jun 19, 2022

Add torchrun launcher plugin #2119

hi, i was wondering if this plugin is available now since i cannot any doc on the hydra website.

@jieru-hu
Copy link
Contributor

hi @zhaoedf
we've added torch run to the contrib folder here https://github.com/facebookresearch/hydra/tree/main/contrib
note that this is not an officially supported Hydra plugin so we may not be able to provide same level support as the official Hydra plugin.

@zhaoedf
Copy link

zhaoedf commented Jun 21, 2022

hi @zhaoedf we've added torch run to the contrib folder here https://github.com/facebookresearch/hydra/tree/main/contrib note that this is not an officially supported Hydra plugin so we may not be able to provide same level support as the official Hydra plugin.

since contrib is not available on pypi, does it mean that i need to download the code in the link you provided and manually install it into my conda env?

@Jasha10
Copy link
Collaborator

Jasha10 commented Jun 21, 2022

@zhaoedf that is correct.

@zhaoedf
Copy link

zhaoedf commented Aug 2, 2022

for anyone might be interested, you can use --use-env to pass the local_rank param without causing a conflict.

CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port $((GPU + 19020)) --nproc_per_node ${NB_GPU} --use_env $project_path/train_wo_ddp.py

local_rank = int(os.environ["LOCAL_RANK"])

@AndyJZhao
Copy link

For anyone who is struggling, just simply use torchrun, which sets --use_env as default:
CUDA_VISIBLE_DEVICES=0,1 path/to/conda/bin/torchrun --master_port={your_port} --nproc_per_node=2 project/path/run.py exp=your/hydra/expsetting
local_rank = int(os.environ["LOCAL_RANK"])

@acherstyx
Copy link

I've created an updated version of the hydra-torchrun-launcher plugin at https://github.com/acherstyx/hydra-torchrun-launcher. I've resolved the PicklingError by utilizing cloudpickle.

Now, this version of the launcher should work exactly like launching with torchrun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Hydra usage question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants