Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support non blocking launching #692

Open
jan-matthis opened this issue Jun 19, 2020 · 10 comments
Open

[Feature Request] Support non blocking launching #692

jan-matthis opened this issue Jun 19, 2020 · 10 comments
Labels
enhancement Enhanvement request plugin Plugins realted issues
Milestone

Comments

@jan-matthis
Copy link
Contributor

🚀 Feature Request

Some launchers enqueue jobs rather than executing them directly. It'd be great if Hydra supported that.

The interim solution in the RQ launcher (#683) is to throw an exception after enqueuing, which doesn't make for a great user experience.

See also: #683 (comment)

@jan-matthis jan-matthis added the enhancement Enhanvement request label Jun 19, 2020
@omry omry added this to the 1.0.0 milestone Jun 19, 2020
@omry
Copy link
Collaborator

omry commented Jun 19, 2020

Tentative for 1.0.

@omry omry added the plugin Plugins realted issues label Jun 29, 2020
@omry omry modified the milestones: 1.0.0, 1.1.0 Jul 14, 2020
@omry omry changed the title [Feature Request] Support asynchronous launching [Feature Request] Support non blocking launching Jul 14, 2020
@jieru-hu jieru-hu self-assigned this Nov 19, 2020
@omry omry modified the milestones: Hydra 1.1.0, Hydra 1.2.0 Feb 18, 2021
@omry
Copy link
Collaborator

omry commented Feb 18, 2021

Launching enhancements will come in 1.2 or 1.3.

@robogast
Copy link

Also see #1710 for a (relatively) easy fix for this issue, if making the launcher natively async remains out-of-scope for the time being.

@jieru-hu jieru-hu removed their assignment Oct 24, 2021
@Jasha10 Jasha10 modified the milestones: Hydra 1.2.0, Hydra 1.3.0 Nov 12, 2021
@RemyLau
Copy link

RemyLau commented Apr 21, 2022

I think non-blocking launching, especially for submitit, could be very helpful. Seems like a naive solution for this would be simply bypassing the following line when the non-blocking option (say submitit_slurm_nonblocking) is set

and replace it with return None ([Update]: probably need to return a list of trivial JobReturns, see more below).

Apparently, this assumes that the target function does not return anything (usually directly saving results to some files for example), which is typically how I'm setting things up.

Would this be a reasonable patch to be implemented for the non-blocking feature?

@Jasha10
Copy link
Collaborator

Jasha10 commented Apr 22, 2022

@RemyLau what use-case would be enabled by removing the line you mentioned above?

@RemyLau
Copy link

RemyLau commented Apr 23, 2022

Hi @Jasha10, the way I've been using the submitit_slurm plugin is to mimic how I would use SLURM to submit a bunch of jobs to run a script with different settings via job arrays. In short, each job would save its own result, so I don't need a master worker to aggregate all the results. Hence, there's no need to block and wait for all jobs to be finshed.

Usual setup (without Hydra)

Here's an example of my usual main script (say src/main.py), which consists of four major components:

  1. Load configurations based on the selected option
  2. Load data
  3. Run some heavy computation on the data with the configurations, which usually takes hours to run
  4. Save the results once the computations are done
def main() -> None:
    parser = argparse.ArgumentParser()
    parser.addargument("--cfg", type=int, help="Configuration id")
    args = parser.parse_args()
    # Here I'm simplifying the configuration loading process.. in reality, it's much messier....
    cfg = load_cfg(args.cfg)

    data = load_data(cfg.data_dir)
    results = heavy_computations(data, cfg.my_params)
    save_results(cfg.result_dir, results)

if __name__ == "__main__":
    main()

Suppose I have 10 different configurations, and I would like to run all of them using SLURM job submission, I would then create a SLURM batch job file (say slurm_job.sb) that looks something like the following.

#SBATCH -N 1
#SBATCH -c 4
#SBATCH -t 4:00:00
#SBATCH --mem=8GB
#SBATCH --array=0-9   # 10 different settings I wish to run

cd $SLURM_SUBMIT_DIR
python src/main.py --cfg $SLURM_ARRAY_TASK_ID

which I will then run via sbatch slurm_job.sb.

Using Hydra

Now, because Hydra is so awsome :) and provides a really nice integration with SLURM via submitit, I can reduce all the above to just the following.

@hydra.main(version_base=None, config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
    data = load_data(cfg.data_dir)
    results = heavy_computations(data, cfg.my_params)
    save_results(cfg.result_dir, results)

if __name__ == "__main__":
    main()

Now, to submit the jobs using SLURM, all I need to do is then

python src/main.py -m hydra/launcher=submitit_slurm ... # plus other overrides and sweep params

Issue

Everything is pretty satisfying up to this point. The only problem is that Hydra waits for the SLURM jobs to be finished before exiting. I can see why this is desirable in many cases such as when doing hyperparameter tuning, but since in my case, each job will save its own results, I don't need to have a "master worker" to wait for all results to come back. So all I want is to submit the jobs and let them run, and I do not wish to wait until all the jobs are finished (which could take days if I'm submitting hundreds or thousands of them).

Workaround

A hacky solution that I've been using is simply Ctrl-D after I see that all the jobs are submitted. Another workaround is to use something like tmux (#1847 (comment)). But ultimately, it would be great to have support to do this non-blocking job submission natively.

Proposed solution

The return line above is where the blocking happens (particularly j.results()). So then the idea is to NOT wait for results to come back and simply return None, which anyway will be the return value.

But after looking a little deeper into source code, I wanted to make a small correction to my original proposed solution: instead of returning None, I should return a list of trivial JobReturn object, where the status is set to be JobStatus.COMPLETED. This way, the master hydra thread could exit correctly. I've also created a PR that implements what I proposed here (#2171).

@Jasha10
Copy link
Collaborator

Jasha10 commented Apr 25, 2022

Awesome, thanks for the explanation @RemyLau!

@jieru-hu
Copy link
Contributor

thanks for the context and the PR @RemyLau

To support non-blocking launch we will need to take a look across all the launchers and make sure we come up with something that makes sense across the board. As you can see, this is planned for Hydra 1.3 for now.

For now, i'd recommend look into solution like tmux. Such tools should allow you to run the script in the background and re-attach the session if needed.

@timruhkopf
Copy link

timruhkopf commented Feb 15, 2024

Hi,
I am super curious about the state of this feature -- saw that this has been discussed in the milestones for hydra 1.3. . Plus, i didn't find something in the docs yet (although pointers would be appreciated). I mean hydra-core==1.3.2 by now.

I would also like to point out, while requiring a (slim) main worker is slightly painful, particularly if the subprocesses take longer, there is a way more compelling usecase: asynchronously working Hyperparameter Optimizers like ASHA.

Anyways, I would appreciate an early access if there is something already cooked up for it.

@odelalleau
Copy link
Collaborator

Anyways, I would appreciate an early access if there is something already cooked up for it.

I think it's unlikely to be implemented anytime soon, since as @jieru-hu mentioned it would require some non-trivial amount of work to make it consistent across the whole codebase, and there exists a simple workaround (nohup python my_app.py .... & -- where nohup may not necessarily be required depending on your system) that should do the trick. If for some reason this doesn't work for your use case, please describe why, there may be other solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhanvement request plugin Plugins realted issues
Projects
None yet
Development

No branches or pull requests

8 participants