-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Support non blocking launching #692
Comments
Tentative for 1.0. |
Launching enhancements will come in 1.2 or 1.3. |
Also see #1710 for a (relatively) easy fix for this issue, if making the launcher natively async remains out-of-scope for the time being. |
I think non-blocking launching, especially for hydra/plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py Line 146 in 61ab29a
and replace it with Apparently, this assumes that the target function does not return anything (usually directly saving results to some files for example), which is typically how I'm setting things up. Would this be a reasonable patch to be implemented for the non-blocking feature? |
@RemyLau what use-case would be enabled by removing the line you mentioned above? |
Hi @Jasha10, the way I've been using the Usual setup (without Hydra)Here's an example of my usual main script (say
def main() -> None:
parser = argparse.ArgumentParser()
parser.addargument("--cfg", type=int, help="Configuration id")
args = parser.parse_args()
# Here I'm simplifying the configuration loading process.. in reality, it's much messier....
cfg = load_cfg(args.cfg)
data = load_data(cfg.data_dir)
results = heavy_computations(data, cfg.my_params)
save_results(cfg.result_dir, results)
if __name__ == "__main__":
main() Suppose I have 10 different configurations, and I would like to run all of them using SLURM job submission, I would then create a SLURM batch job file (say #SBATCH -N 1
#SBATCH -c 4
#SBATCH -t 4:00:00
#SBATCH --mem=8GB
#SBATCH --array=0-9 # 10 different settings I wish to run
cd $SLURM_SUBMIT_DIR
python src/main.py --cfg $SLURM_ARRAY_TASK_ID which I will then run via Using HydraNow, because Hydra is so awsome :) and provides a really nice integration with SLURM via @hydra.main(version_base=None, config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
data = load_data(cfg.data_dir)
results = heavy_computations(data, cfg.my_params)
save_results(cfg.result_dir, results)
if __name__ == "__main__":
main() Now, to submit the jobs using SLURM, all I need to do is then python src/main.py -m hydra/launcher=submitit_slurm ... # plus other overrides and sweep params IssueEverything is pretty satisfying up to this point. The only problem is that Hydra waits for the SLURM jobs to be finished before exiting. I can see why this is desirable in many cases such as when doing hyperparameter tuning, but since in my case, each job will save its own results, I don't need to have a "master worker" to wait for all results to come back. So all I want is to submit the jobs and let them run, and I do not wish to wait until all the jobs are finished (which could take days if I'm submitting hundreds or thousands of them). WorkaroundA hacky solution that I've been using is simply Ctrl-D after I see that all the jobs are submitted. Another workaround is to use something like Proposed solutionThe return line above is where the blocking happens (particularly But after looking a little deeper into source code, I wanted to make a small correction to my original proposed solution: instead of returning |
Awesome, thanks for the explanation @RemyLau! |
thanks for the context and the PR @RemyLau To support non-blocking launch we will need to take a look across all the launchers and make sure we come up with something that makes sense across the board. As you can see, this is planned for Hydra 1.3 for now. For now, i'd recommend look into solution like tmux. Such tools should allow you to run the script in the background and re-attach the session if needed. |
Hi, I would also like to point out, while requiring a (slim) main worker is slightly painful, particularly if the subprocesses take longer, there is a way more compelling usecase: asynchronously working Hyperparameter Optimizers like ASHA. Anyways, I would appreciate an early access if there is something already cooked up for it. |
I think it's unlikely to be implemented anytime soon, since as @jieru-hu mentioned it would require some non-trivial amount of work to make it consistent across the whole codebase, and there exists a simple workaround ( |
🚀 Feature Request
Some launchers enqueue jobs rather than executing them directly. It'd be great if Hydra supported that.
The interim solution in the RQ launcher (#683) is to throw an exception after enqueuing, which doesn't make for a great user experience.
See also: #683 (comment)
The text was updated successfully, but these errors were encountered: