[Feature Request] Support non blocking launching #692

jan-matthis · 2020-06-19T18:43:55Z

🚀 Feature Request

Some launchers enqueue jobs rather than executing them directly. It'd be great if Hydra supported that.

The interim solution in the RQ launcher (#683) is to throw an exception after enqueuing, which doesn't make for a great user experience.

Usual setup (without Hydra)

Here's an example of my usual main script (say src/main.py), which consists of four major components:

Load configurations based on the selected option
Load data
Run some heavy computation on the data with the configurations, which usually takes hours to run
Save the results once the computations are done

def main() -> None:
    parser = argparse.ArgumentParser()
    parser.addargument("--cfg", type=int, help="Configuration id")
    args = parser.parse_args()
    # Here I'm simplifying the configuration loading process.. in reality, it's much messier....
    cfg = load_cfg(args.cfg)

    data = load_data(cfg.data_dir)
    results = heavy_computations(data, cfg.my_params)
    save_results(cfg.result_dir, results)

if __name__ == "__main__":
    main()

Suppose I have 10 different configurations, and I would like to run all of them using SLURM job submission, I would then create a SLURM batch job file (say slurm_job.sb) that looks something like the following.

#SBATCH -N 1
#SBATCH -c 4
#SBATCH -t 4:00:00
#SBATCH --mem=8GB
#SBATCH --array=0-9   # 10 different settings I wish to run

cd $SLURM_SUBMIT_DIR
python src/main.py --cfg $SLURM_ARRAY_TASK_ID

which I will then run via sbatch slurm_job.sb.

Using Hydra

Now, because Hydra is so awsome :) and provides a really nice integration with SLURM via submitit, I can reduce all the above to just the following.

@hydra.main(version_base=None, config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
    data = load_data(cfg.data_dir)
    results = heavy_computations(data, cfg.my_params)
    save_results(cfg.result_dir, results)

if __name__ == "__main__":
    main()

Now, to submit the jobs using SLURM, all I need to do is then

python src/main.py -m hydra/launcher=submitit_slurm ... # plus other overrides and sweep params

Issue

Everything is pretty satisfying up to this point. The only problem is that Hydra waits for the SLURM jobs to be finished before exiting. I can see why this is desirable in many cases such as when doing hyperparameter tuning, but since in my case, each job will save its own results, I don't need to have a "master worker" to wait for all results to come back. So all I want is to submit the jobs and let them run, and I do not wish to wait until all the jobs are finished (which could take days if I'm submitting hundreds or thousands of them).

Workaround

A hacky solution that I've been using is simply Ctrl-D after I see that all the jobs are submitted. Another workaround is to use something like tmux (#1847 (comment)). But ultimately, it would be great to have support to do this non-blocking job submission natively.

Proposed solution

The return line above is where the blocking happens (particularly j.results()). So then the idea is to NOT wait for results to come back and simply return None, which anyway will be the return value.

But after looking a little deeper into source code, I wanted to make a small correction to my original proposed solution: instead of returning None, I should return a list of trivial JobReturn object, where the status is set to be JobStatus.COMPLETED. This way, the master hydra thread could exit correctly. I've also created a PR that implements what I proposed here (#2171).

Jasha10 · 2022-04-25T21:12:32Z

Awesome, thanks for the explanation @RemyLau!

jieru-hu · 2022-04-25T21:39:54Z

thanks for the context and the PR @RemyLau

To support non-blocking launch we will need to take a look across all the launchers and make sure we come up with something that makes sense across the board. As you can see, this is planned for Hydra 1.3 for now.

For now, i'd recommend look into solution like tmux. Such tools should allow you to run the script in the background and re-attach the session if needed.

timruhkopf · 2024-02-15T10:06:46Z

Hi,
I am super curious about the state of this feature -- saw that this has been discussed in the milestones for hydra 1.3. . Plus, i didn't find something in the docs yet (although pointers would be appreciated). I mean hydra-core==1.3.2 by now.

I would also like to point out, while requiring a (slim) main worker is slightly painful, particularly if the subprocesses take longer, there is a way more compelling usecase: asynchronously working Hyperparameter Optimizers like ASHA.

Anyways, I would appreciate an early access if there is something already cooked up for it.

odelalleau · 2024-02-15T14:21:55Z

Anyways, I would appreciate an early access if there is something already cooked up for it.

I think it's unlikely to be implemented anytime soon, since as @jieru-hu mentioned it would require some non-trivial amount of work to make it consistent across the whole codebase, and there exists a simple workaround (nohup python my_app.py .... & -- where nohup may not necessarily be required depending on your system) that should do the trick. If for some reason this doesn't work for your use case, please describe why, there may be other solutions.

jan-matthis added the enhancement Enhanvement request label Jun 19, 2020

jan-matthis mentioned this issue Jun 19, 2020

RQ launcher #683

Merged

omry added this to the 1.0.0 milestone Jun 19, 2020

omry added the plugin Plugins realted issues label Jun 29, 2020

omry modified the milestones: 1.0.0, 1.1.0 Jul 14, 2020

omry changed the title ~~[Feature Request] Support asynchronous launching~~ [Feature Request] Support non blocking launching Jul 14, 2020

jieru-hu self-assigned this Nov 19, 2020

omry modified the milestones: Hydra 1.1.0, Hydra 1.2.0 Feb 18, 2021

jieru-hu mentioned this issue Oct 8, 2021

[Feature Request] [Submitit-Plugin] Add option to not wait for finished jobs #1847

Closed

jieru-hu removed their assignment Oct 24, 2021

Jasha10 modified the milestones: Hydra 1.2.0, Hydra 1.3.0 Nov 12, 2021

ncvv mentioned this issue Apr 14, 2022

Pruning option for Optuna #1954

Open

RemyLau mentioned this issue Apr 23, 2022

Add non-blocking slurm job submission option #2171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Support non blocking launching #692

[Feature Request] Support non blocking launching #692

jan-matthis commented Jun 19, 2020

omry commented Jun 19, 2020

omry commented Feb 18, 2021

robogast commented Oct 21, 2021

RemyLau commented Apr 21, 2022 •

edited

Loading

Jasha10 commented Apr 22, 2022

RemyLau commented Apr 23, 2022 •

edited

Loading

Jasha10 commented Apr 25, 2022

jieru-hu commented Apr 25, 2022

timruhkopf commented Feb 15, 2024 •

edited

Loading

odelalleau commented Feb 15, 2024

[Feature Request] Support non blocking launching #692

[Feature Request] Support non blocking launching #692

Comments

jan-matthis commented Jun 19, 2020

🚀 Feature Request

omry commented Jun 19, 2020

omry commented Feb 18, 2021

robogast commented Oct 21, 2021

RemyLau commented Apr 21, 2022 • edited Loading

Jasha10 commented Apr 22, 2022

RemyLau commented Apr 23, 2022 • edited Loading

Usual setup (without Hydra)

Using Hydra

Issue

Workaround

Proposed solution

Jasha10 commented Apr 25, 2022

jieru-hu commented Apr 25, 2022

timruhkopf commented Feb 15, 2024 • edited Loading

odelalleau commented Feb 15, 2024

RemyLau commented Apr 21, 2022 •

edited

Loading

RemyLau commented Apr 23, 2022 •

edited

Loading

timruhkopf commented Feb 15, 2024 •

edited

Loading