[Feature Request] hydra with torch.distributed.launch #2038

18445864529 · 2022-02-17T10:28:56Z

🚀 Feature Request

Motivation

I want to use hydra with torch.distributed.launch for multi-node multi-GPU training. The problem is the torch.distributed.launch module will automatically pass a local_rank argument to the script thus leading to unrecognized arguments: --local_rank.
In this case, how should I program my customized file accordingly to accept this appended argument?
Currently, it seems command arguments in the format of --some_argument will be directly received by hydra.

The text was updated successfully, but these errors were encountered:

Jasha10 · 2022-02-17T19:15:08Z

The issue is that @hydra.main reads from sys.argv. Since torch.distributed is using sys.argv too, there is a conflict.
The most flexible solution is to use the compose API.

jbaczek · 2022-03-07T15:23:37Z

The problem extends much deeper than to a conflict in the argument space.
First of all torch.distributed.launch is deprecated. You should use torch.distributed.run which passes all arguments by an environment. Documentation says that there is a wrapper around torch.distributed.run named torchrun. Please refer to the documentation: https://pytorch.org/docs/stable/elastic/run.html
Second thing is that torch.distributed works by spawning multiple processes in which it launches instances of your script. This means that it will launch hydra multiple times creating (often conflicting) output dirs.
The solution which I am investigating is to spawn subprocesses after calling hydra. According to this doc: https://pytorch.org/docs/stable/elastic/customization.html it is possible to easily spawn desired processes from python script and to call function, not a whole script.
Thus I wanted to implement a custom launcher. I read the implementation of joblib launcher and example_launcher from hydra's repo. But it looks like currently the only way to use launcher is to do this via a sweeper, and only in multirun setup. I consider my distributed run a single job. I want to have an ability to launch my jobs however I want in single run setup.

Jasha10 · 2022-03-07T15:36:16Z

Thanks for the insight @jbaczek.

That is a good point about being unable to use a custom launcher without also using a custom sweeper + multirun mode. A feature request r.e. support for custom launching via single-run mode would be most welcome.

jbaczek · 2022-03-10T11:07:01Z

I hit another wall with calling torch distributed as a launcher. Launcher is called in the hydra's main wrapper. I wanted to pass original function to the multiprocessing context. This pickles the function to call together with arguments (btw OmegaConf.register_new_resolver is unpicklable because it creates local function in runtime. And this is in a singleton state EDIT: This is apparently a known issue

hydra/hydra/core/singleton.py

Line 22 in 796fdf1

# Plugins can cause issues for pickling the singleton state

) and passes it to Popen. But here comes the issue: Pickle serializes only metadata about the function and then restores it from a code. But at this point hydra's main wrapper already has overwritten main function, so pickled function does not match function that it tries to restore. Here is a repro:

import functools                  
import pickle                     
                                  
def decorate(f):                  
    @functools.wraps(f)           
    def wrapper(*args, **kwargs): 
        print(pickle.dumps(f))    
        return f(*args, **kwargs) 
    return wrapper                
                                  
@decorate                         
def main():                       
    print('A')                    
                                  
if __name__ == '__main__':        
    main()

$ python tmp.py
Traceback (most recent call last):
...
_pickle.PicklingError: Can't pickle <function main at 0x7ff1df907ee0>: it's not the same object as __main__.main

I'm still looking into the issue. Any hints on how to solve this are welcome.
EDIT: Managed to solve it by using fork as a starting method, instead of spawn

longmao-yiran · 2022-03-14T09:00:48Z

you can use --use_env to shield arguments: --local_rank

jbaczek · 2022-03-15T15:24:54Z

@Jasha10 I have implemented a working plugin for this. Would you be interested in MR?

Jasha10 · 2022-03-17T23:24:27Z

@jbaczek I'd certainly be interested to take a look at the code you've written. I can't promise that we'd merge a PR (I'd have to discuss it with the Hydra team).

zhaoedf · 2022-06-19T15:26:45Z

Add torchrun launcher plugin #2119

hi, i was wondering if this plugin is available now since i cannot any doc on the hydra website.

jieru-hu · 2022-06-20T16:47:14Z

hi @zhaoedf
we've added torch run to the contrib folder here https://github.com/facebookresearch/hydra/tree/main/contrib
note that this is not an officially supported Hydra plugin so we may not be able to provide same level support as the official Hydra plugin.

zhaoedf · 2022-06-21T03:31:18Z

hi @zhaoedf we've added torch run to the contrib folder here https://github.com/facebookresearch/hydra/tree/main/contrib note that this is not an officially supported Hydra plugin so we may not be able to provide same level support as the official Hydra plugin.

since contrib is not available on pypi, does it mean that i need to download the code in the link you provided and manually install it into my conda env?

Jasha10 · 2022-06-21T05:03:07Z

@zhaoedf that is correct.

zhaoedf · 2022-08-02T03:25:48Z

for anyone might be interested, you can use --use-env to pass the local_rank param without causing a conflict.

CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port $((GPU + 19020)) --nproc_per_node ${NB_GPU} --use_env $project_path/train_wo_ddp.py

local_rank = int(os.environ["LOCAL_RANK"])

AndyJZhao · 2023-04-11T19:25:42Z

For anyone who is struggling, just simply use torchrun, which sets --use_env as default:
CUDA_VISIBLE_DEVICES=0,1 path/to/conda/bin/torchrun --master_port={your_port} --nproc_per_node=2 project/path/run.py exp=your/hydra/expsetting
local_rank = int(os.environ["LOCAL_RANK"])

acherstyx · 2024-04-19T06:23:32Z

I've created an updated version of the hydra-torchrun-launcher plugin at https://github.com/acherstyx/hydra-torchrun-launcher. I've resolved the PicklingError by utilizing cloudpickle.

Now, this version of the launcher should work exactly like launching with torchrun.

18445864529 added the enhancement Enhanvement request label Feb 17, 2022

Jasha10 added question Hydra usage question and removed enhancement Enhanvement request labels Feb 17, 2022

jbaczek mentioned this issue Mar 7, 2022

[Feature Request]Using a launcher in a single run setup #2074

Closed

jbaczek mentioned this issue Mar 31, 2022

Add torchrun launcher plugin #2119

Merged

jieru-hu closed this as completed in #2119 May 26, 2022

Jasha10 mentioned this issue Aug 31, 2022

add task_function kwarg to on_job_start callback #2361

Merged

1 task

PulkitAgr113 mentioned this issue Jun 12, 2024

Unrecognized argument - local rank when training kakaobrain/honeybee#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] hydra with torch.distributed.launch #2038

[Feature Request] hydra with torch.distributed.launch #2038

18445864529 commented Feb 17, 2022

Jasha10 commented Feb 17, 2022

jbaczek commented Mar 7, 2022

Jasha10 commented Mar 7, 2022

jbaczek commented Mar 10, 2022 •

edited by Jasha10

Loading

longmao-yiran commented Mar 14, 2022

jbaczek commented Mar 15, 2022

Jasha10 commented Mar 17, 2022

zhaoedf commented Jun 19, 2022

jieru-hu commented Jun 20, 2022

zhaoedf commented Jun 21, 2022

Jasha10 commented Jun 21, 2022

zhaoedf commented Aug 2, 2022 •

edited

Loading

AndyJZhao commented Apr 11, 2023

acherstyx commented Apr 19, 2024

[Feature Request] hydra with torch.distributed.launch #2038

[Feature Request] hydra with torch.distributed.launch #2038

Comments

18445864529 commented Feb 17, 2022

🚀 Feature Request

Motivation

Jasha10 commented Feb 17, 2022

jbaczek commented Mar 7, 2022

Jasha10 commented Mar 7, 2022

jbaczek commented Mar 10, 2022 • edited by Jasha10 Loading

longmao-yiran commented Mar 14, 2022

jbaczek commented Mar 15, 2022

Jasha10 commented Mar 17, 2022

zhaoedf commented Jun 19, 2022

jieru-hu commented Jun 20, 2022

zhaoedf commented Jun 21, 2022

Jasha10 commented Jun 21, 2022

zhaoedf commented Aug 2, 2022 • edited Loading

AndyJZhao commented Apr 11, 2023

acherstyx commented Apr 19, 2024

jbaczek commented Mar 10, 2022 •

edited by Jasha10

Loading

zhaoedf commented Aug 2, 2022 •

edited

Loading