Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] [Submitit-Plugin] (Potentially a bug) Impossible to set certain flags in submitit launcher #1366

Closed
Queuecumber opened this issue Feb 1, 2021 · 35 comments · Fixed by #1375
Labels
enhancement Enhanvement request

Comments

@Queuecumber
Copy link
Contributor

🚀 Feature Request

Motivation

This could be read as a feature request or a bug report, I'm not sure how you want to consider it, I'm going with feature request. Currently theres some flags that you cannot set in the submitit launcher. I practice "gpu centric" scheduling, so I like to specify mem_per_gpu and cpus_per_gpu and then I can just use gpus_per_task to always get the optimal settings.

For example on "mystery cluster" (you know the one), we can use 10 cpus and 64GB RAM per GPU. These settings allow me to only have to change gpus_per_task for example if I need to have 2 GPUs for 2 different models, then I'll automatically get 20 cpus and 128GB RAM for each of the tasks without having to change all of the settings. I've already PRed stuff related to this into submitit.

The problem occurs when you try to set, via additional parameters, something like mem_per_gpu. You can't set mem and mem_per_gpu, slurm just crashes when you do that. Similarly, if you try to set cpus_per_gpu via additional_parameters you'll wind up setting that in addition to cpus_per_task.

Pitch

I tried making a simple patch that fixes this, but it hits issues with the AutoExecutor which I never updated in submitit to be aware of the options which conflict with each other. I think in general auto is missing some of the recent work in submitit, and it feels like a semi-hacky workaround to me.

Submitit already has pretty good validation logic so my pitch is to (1) let submitit handle everything by calling the correct executor for the job instead of using Auto. This is going to require a revamp of how the parameters are named/passed however and will likely be a breaking change to the API. (2) It would be nice if we could allow people to pass whatever parameters are supported by submitit without needing to update the hydra schema each time. These are my two major goals, so they sound reasonable/feasible? Point (2) may not be possible I guess.

Other options:

  • Use auto but try to include our own validation logic (needs exploration), I don't like this because it requires us to duplicate logic that submitit has already and also it may not even be possible depending on what Auto decides to do (looked iffy on my quick pass)
  • Punt this to submitit to revamp the auto executor to do the right stuff and be more flexible. I don't like this because I think the auto executor is more of a convenience hack and more complex/flexible code would be calling the executors directly (this is what I was doing previously with my own submitit hydra wrapper thing). I also think its going to be a more more complex PR.

Are you willing to open a pull request? (See CONTRIBUTING)

Yes, but I want feedback first on the best way to go about it

Additional context

Add any other context or screenshots about the feature request here.

@Queuecumber Queuecumber added the enhancement Enhanvement request label Feb 1, 2021
@jieru-hu
Copy link
Contributor

jieru-hu commented Feb 2, 2021

Thanks for reporting @Queuecumber !

This sounds like a issue with the Hydra default config not having the params as Optional

Could you share a minimal repro of your configuration?

@Queuecumber
Copy link
Contributor Author

Queuecumber commented Feb 2, 2021 via email

@jieru-hu
Copy link
Contributor

jieru-hu commented Feb 2, 2021

It's more than that unfortunately, I tried making them optional and that's where I run into the described problems with the auto executor.

On Mon, Feb 1, 2021, 20:14 Jieru Hu @.***> wrote: Thanks for reporting @Queuecumber https://github.com/Queuecumber ! This sounds like a issue with the Hydra default config not having the params as Optional https://github.com/facebookresearch/hydra/blob/master/plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py#L17-L23 Could you share a minimal repro of your configuration? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1366 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMX7KYRJ4ZA37KLBPG4G33S45GXXANCNFSM4W35YCAA .

Got it, thanks. It would still be helpful if we can get a minimal repro here when you get a chance.

@Queuecumber
Copy link
Contributor Author

Queuecumber commented Feb 2, 2021 via email

@omry
Copy link
Collaborator

omry commented Feb 2, 2021

To me it sounds more like a design issue with the plugin/Submitit's AutoExecutor.

Once we have a concrete example of what you are trying to do it will be easier to discuss and involve people from the submititit side if we need to.

@Queuecumber
Copy link
Contributor Author

Here's an example config that should reproduce:

defaults:
  - hydra/launcher: submitit_slurm

hydra:
  launcher:
    mem_gb: null
    additional_parameters:
      mem_per_gpu: 64GB

with a simple python file:

import hydra
from omegaconf import DictConfig


@hydra.main(config_name="slurm_extra_params", config_path="configs")
def main(cfg: DictConfig):
    return


if __name__ == "__main__":
    main()

As of current hydra, this will throw an error that a required parameter was set to None (or something like that), if you modify:

to read mem_gb: Optional[int] = None then AutoExecutor complains with a similar error:

Traceback (most recent call last):
  File "/private/home/mehrlich/.conda/envs/qgac/lib/python3.7/site-packages/hydra/_internal/utils.py", line 212, in run_and_report
    return func()
  File "/private/home/mehrlich/.conda/envs/qgac/lib/python3.7/site-packages/hydra/_internal/utils.py", line 376, in <lambda>
    overrides=args.overrides,
  File "/private/home/mehrlich/.conda/envs/qgac/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 139, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/private/home/mehrlich/.conda/envs/qgac/lib/python3.7/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/private/home/mehrlich/.conda/envs/qgac/lib/python3.7/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py", line 127, in launch
    executor.update_parameters(**params)
  File "/private/home/mehrlich/.conda/envs/qgac/lib/python3.7/site-packages/submitit/core/core.py", line 669, in update_parameters
    self._internal_update_parameters(**kwargs)
  File "/private/home/mehrlich/.conda/envs/qgac/lib/python3.7/site-packages/submitit/auto/auto.py", line 147, in _internal_update_parameters
    f'Parameter "{name}" expected type {expected_type} ' f'(but value: "{kwargs[name]}")'
AssertionError: Parameter "mem_gb" expected type <class 'int'> (but value: "None")

because it is hardcoded to want mem_gb, which is a little weird because as far as I know memory settings dont do anything for the local executor.

My view on it is that the AutoExecutor is fine for simple things but anything requireing fine-grained control over the executor should call the executors directly.

@omry
Copy link
Collaborator

omry commented Feb 2, 2021

It looks like the plugin configuration is faithfully reflecting the intention of the library authors. You should probably open this discussion in the submitit repo.

@Queuecumber
Copy link
Contributor Author

I disagree because the slurm executor allows these to all be optional, the auto executor just wasn't updated to support that and it isnt clear how best to do it.

@Queuecumber
Copy link
Contributor Author

Queuecumber commented Feb 2, 2021

Another way to phrase it is that the separate local and slurm executors support a lot more features than the auto executor so by tying the submitit plugin to the auto executor hydra can't support these more advanced features

@omry
Copy link
Collaborator

omry commented Feb 2, 2021

The AutoExecutor is a part of submitit.
@jrapin ported the Hydra submitit plugin to use the auto executor (It used to allow the user to configure specific Submitit executors).
Please discuss this design choice with him.

@Queuecumber
Copy link
Contributor Author

Queuecumber commented Feb 2, 2021

Rodger that, I'll await his feedback

@jrapin
Copy link
Contributor

jrapin commented Feb 3, 2021

because it is hardcoded to want mem_gb, which is a little weird because as far as I know memory settings dont do anything for the local executor.

LocalExecutor is indeed more of a hack to make sure things will work correctly on the actual cluster.

My view on it is that the AutoExecutor is fine for simple things but anything requireing fine-grained control over the executor should call the executors directly.

AutoExecutor supports (or is supposed to support) all the features each executor can provide. So it's intentional that everything should go through it. If there is something you could do with SlurmExecutor and not with AutoExecutor then I would assume some bug (or non-intentional feature :) ).

Submitit already has pretty good validation logic

Actually there is hardly any, most of the logic (especially for mem_per_node/mem_per_gpu) is provided by slurm. As you actually said, we don't want te reimplement something that is already done somewhere else ;).
The issue with mem_per_gpu is probably that mem_gb used to be the only option (all the per_task and per_gpu only appeared recently in slurm) so was deemed necessary (like gpus_per_node used to be before we updated the code to allow gpus_per_task). We probably need to update the code to make sure it works either way now that this option exists.
As a workaround you can probably set mem_gb to 0 instead of None because I think that it will deactivate the parameter (didnt try, may be wrong). In any case, for all this to work seamlessly it will need a PR on submitit side (and maybe one here as well to change the types/add parameters, depending on what changed on submitit side).

@jrapin
Copy link
Contributor

jrapin commented Feb 3, 2021

Actually I am partly wrong since it seems you (@Queuecumber) already encoded the logic to make mem_per_gpu work within submitit.
So the issue comes from not being able to override the default values in the hydra submitit plugin config. I still think that setting to 0 will work, but it's a bit confusing, so we should maybe allow to override with None. The most natural way in my opinion would be to ignore keys which are set to None so that they are not passed to submitit.

@Queuecumber
Copy link
Contributor Author

Queuecumber commented Feb 3, 2021

So the issue comes from not being able to override the default values in the hydra submitit plugin config. I still think that setting to 0 will work, but it's a bit confusing, so we should maybe allow to override with None

It's a little more than that, the main issue seems to be that AutoExecutor needs mem_gb to be set to an integer, it fails with error

AssertionError: Parameter "mem_gb" expected type <class 'int'> (but value: "None")

if you allow it to be Optional in the hydra plugin.

If you set it to zero, you again get

submitit.core.utils.FailedJobError: sbatch: fatal: --mem, --mem-per-cpu, and --mem-per-gpu are mutually exclusive.

from slurm, I'm not sure how this works for the CPU settings.

So there's two ways to proceed:

  1. AutoExecutor is unchanged and rebranded a simple way to get a "lowest common denominator" interface that works well for most cases and which can freely switch between cluster and local execution but doesn't support advanced slurm features. Hydra submitit plugin is updated to call the correct executor directly since it should support advanced slurm features.
  2. Hydra submitit plugin is updated to allow some more optional parameters and AutoExecutor is updated to allow all advanced slurm features, which may entail adding additional validation (somewhere in https://github.com/facebookincubator/submitit/blob/master/submitit/auto/auto.py#L124) potentially duplicating effort of the slurm executor, or being more lax about what parameters it requires and letting the underlying executor complain if there's an issue.

I think either option seems reasonable, probably it does make sense to be more lax in the auto executor and just let the slurm executor complain if something goes wrong.

@jrapin
Copy link
Contributor

jrapin commented Feb 3, 2021

Please don't go for 1, this is what we try to avoid within submitit, the interface should be AutoExecutor to have a shared interface on as many aspects as possible. If anything is not supported in AutoExecutor while it is by SlurmExecutor, it's a bug, so submitit needs to be updated.

And 2 seems more complicated than needed, I don't see any code in AutoExecutor that prevents us from "not setting" the generic parameters. I'll crosscheck that. And I don't get why setting to 0 does not work, I'll investigate this as well.

@Queuecumber
Copy link
Contributor Author

And 2 seems more complicated than needed, I don't see any code in AutoExecutor that prevents us from "not setting" the generic parameters.

Because it checks that the value of mem_gb is an integer, see quoted error in my reply

I'll crosscheck that. And I don't get why setting to 0 does not work, I'll investigate this as well.

Because it actually passes to sbatch mem=0 and mem-per-gpu=64GB (or whatever) and you can't set both thats a hard error in slurm

@Queuecumber
Copy link
Contributor Author

Also why is it called mem_gb when slurm calls it mem?

@Queuecumber
Copy link
Contributor Author

Here's the generated sbatch for reference

#!/bin/bash

# Parameters
#SBATCH --cpus-per-task=1
#SBATCH --error=/private/home/mehrlich/mmenhance/multirun/2021-02-03/06-40-21/.submitit/%j/%j_0_log.err
#SBATCH --job-name=minimal
#SBATCH --mem=0GB
#SBATCH --mem-per-gpu=64GB
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --open-mode=append
#SBATCH --output=/private/home/mehrlich/mmenhance/multirun/2021-02-03/06-40-21/.submitit/%j/%j_0_log.out
#SBATCH --signal=USR1@120
#SBATCH --time=60
#SBATCH --wckey=submitit

# command
export SUBMITIT_EXECUTOR=slurm
srun --output '/private/home/mehrlich/mmenhance/multirun/2021-02-03/06-40-21/.submitit/%j/%j_%t_log.out' --error '/private/home/mehrlich/mmenhance/multirun/2021-02-03/06-40-21/.submitit/%j/%j_%t_log.err' --unbuffered /private/home/mehrlich/.cache/pypoetry/virtualenvs/mme-Rb1sjV-9-py3.6/bin/python -u -m submitit.core._submit '/private/home/mehrlich/mmenhance/multirun/2021-02-03/06-40-21/.submitit/%j'

Note

#SBATCH --mem=0GB
#SBATCH --mem-per-gpu=64GB

@jrapin
Copy link
Contributor

jrapin commented Feb 3, 2021

Because it checks that the value of mem_gb is an integer, see quoted error in my reply

it checks it if you provide it to the executor, what I mean is you could just not forward None values to the executor in the plugin, because they are always ignored (that's the design we have been using in submitit). Not passing the value would work, I added a test to check it facebookincubator/submitit#1606

Because it actually passes to sbatch mem=0

it ignores 0 values, but it's passed as "0GB" which is not ignored, I fixed this behavior in the PR facebookincubator/submitit#1606

Also why is it called mem_gb when slurm calls it mem?

It was an attempt to unirformize naming (and add an explicit unit) from the time we supported several internal clusters (including a non-slurm cluster)

@Queuecumber
Copy link
Contributor Author

it checks it if you provide it to the executor, what I mean is you could just not forward None values to the executor in the plugin, because they are always ignored (that's the design we have been using in submitit). Not passing the value would work, I added a test to check it facebookincubator/submitit#1606

OK I see the distinction, let me quickly try adding something to hydra which doesnt pass the None value and see if that fixes it

@Queuecumber
Copy link
Contributor Author

Confirmed that fixes it, in that case I think the easiest path forward is to just PR this into the hydra launcher, its a very small change, I'll verify it works with the cpu options too

@Queuecumber
Copy link
Contributor Author

You may not need facebookincubator/submitit#1606 in light of this

@Queuecumber
Copy link
Contributor Author

@omry While I'm doing this, would you be interested in me updating the SlurmQueueConf to support all of the options that submitit SlurmExecutor now handles? There are quite a few missing from what I remember

@Queuecumber
Copy link
Contributor Author

Confirmed that it all looks like it's working with the other parameters, heres the example config I am using:

defaults:
  - hydra/launcher: submitit_slurm

hydra:
  launcher:
    additional_parameters:
      mem_per_gpu: 64GB
      cpus_per_gpu: 10
      gpus_per_task: 1

and the generated sbatch

#!/bin/bash

# Parameters
#SBATCH --cpus-per-gpu=10
#SBATCH --error=/private/home/mehrlich/mmenhance/multirun/2021-02-03/08-32-53/.submitit/%j/%j_0_log.err
#SBATCH --gpus-per-task=1
#SBATCH --job-name=minimal
#SBATCH --mem-per-gpu=64GB
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --open-mode=append
#SBATCH --output=/private/home/mehrlich/mmenhance/multirun/2021-02-03/08-32-53/.submitit/%j/%j_0_log.out
#SBATCH --signal=USR1@120
#SBATCH --time=60
#SBATCH --wckey=submitit

# command
export SUBMITIT_EXECUTOR=slurm
srun --output '/private/home/mehrlich/mmenhance/multirun/2021-02-03/08-32-53/.submitit/%j/%j_%t_log.out' --error '/private/home/mehrlich/mmenhance/multirun/2021-02-03/08-32-53/.submitit/%j/%j_%t_log.err' --unbuffered /private/home/mehrlich/.cache/pypoetry/virtualenvs/mme-Rb1sjV-9-py3.6/bin/python -u -m submitit.core._submit '/private/home/mehrlich/mmenhance/multirun/2021-02-03/08-32-53/.submitit/%j'

looks perfect

@omry
Copy link
Collaborator

omry commented Feb 3, 2021

While I'm doing this, would you be interested in me updating the SlurmQueueConf to support all of the options that submitit SlurmExecutor now handles? There are quite a few missing from what I remember

Yup. That would be good.

@sumanabasu
Copy link

sumanabasu commented Nov 15, 2022

I'm using hydra 1.2.0 with submitit plugin the following config as suggested by @Queuecumber :

defaults:
  - override hydra/launcher: submitit_slurm
 
hydra:
 launcher:
  mem_per_gpu: 32GB
  cpus_per_gpu: 1
  gpus_per_task: 1

But, I'm still getting the error:
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

Any suggestion?

@Queuecumber
Copy link
Contributor Author

Can you try with an integer for mem_per_gpu? So 32 instead of 32GB? I'm not 100% sure but I think that may be messing it up

@sumanabasu
Copy link

@Queuecumber Thanks for the reply. No luck even after trying it with 32. :(

@Queuecumber
Copy link
Contributor Author

Can you post the generated sbatch file? That might help illustrate what's going wrong

@sumanabasu
Copy link

sumanabasu commented Nov 16, 2022

This is the full config:

defaults:
  - env_params
  - model_data
  - model_training
  - model: decoder_only_transformer
  - optimizer: adam
  - _self_
  - override hydra/launcher: submitit_slurm

root_path : /home/project/
data_path: /home/data/


hydra:
	run:
    		dir: ${root_path}expt/${now:%Y-%m-%d %H-%M-%S}
  	sweep:
    		dir: ${hydra.run.dir}
    		subdir: ${hydra.job.num}
  	launcher:
    		submitit_folder: ${hydra.sweep.dir}/.submitit/%j #${hydra.sweep.dir}/submitit/${hydra.job.num} #${hydra.sweep
    		name: ${hydra.job.name}
   		 _target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
    		mem_per_gpu: 32
    		cpus_per_gpu: 1
    		gpus_per_task: 1
	sweeper:
		params:
			optimizer.config.init_lr: 0.01, 0.001

And this is the auto-generated hydra launcher section in the multirun.yaml:

launcher:
	submitit_folder: ${hydra.sweep.dir}/.submitit/%j
	timeout_min: 60
	cpus_per_task: null
	gpus_per_node: null
	tasks_per_node: 1

	mem_gb: null
	nodes: 1
	name: ${hydra.job.name}
	stderr_to_stdout: false
	_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
	partition: null
	qos: null
	comment: null
	constraint: null
	exclude: null
	gres: null
	cpus_per_gpu: 1
	gpus_per_task: 1
	mem_per_gpu: '32'
	mem_per_cpu: null
	account: null
	signal_delay_s: 120
	max_num_timeout: 0
	additional_parameters: {}
	array_parallelism: 256
	setup: null

@Queuecumber
Copy link
Contributor Author

what about the sbatch? It should be in the .submitit folder

@sumanabasu
Copy link

#!/bin/bash

# Parameters
#SBATCH --array=0-1%2
#SBATCH --cpus-per-gpu=1
#SBATCH --error='/home/project/expt/2022-11-16 15-42-16/.submitit/%A_%a/%A_%a_0_log.err'
#SBATCH --gpus-per-task=1
#SBATCH --job-name=my_job
#SBATCH --mem-per-gpu=32
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --open-mode=append
#SBATCH --output='/home/project/expt/2022-11-16 15-42-16/.submitit/%A_%a/%A_%a_0_log.out'
#SBATCH --signal=USR2@120
#SBATCH --time=60
#SBATCH --wckey=submitit

# command
export SUBMITIT_EXECUTOR=slurm
srun --unbuffered --output '/home/project/expt/2022-11-16 15-42-16/.submitit/%A_%a/%A_%a_%t_log.out' --error '/home/project/expt/2022-11-16 15-42-16/.submitit/%A_%a/%A_%a_%t_log.err' /home/.conda/envs/my_env/bin/python -u -m submitit.core._submit '/home/project/expt/2022-11-16 15-42-16/.submitit/%j'

@Queuecumber
Copy link
Contributor Author

that sbatch looks correct in that it only sets mem-per-gpu and not the other fields, I would contact your slurm admin because it's possible there's a default mem field being specified by the backend or something

@timruhkopf
Copy link

I am facing the same issue. Was this successfully resolved @sumanabasu?

My admin also confirmed that this should actually be running (and replacing the srun with a
srun python sleep.py call verified that at least the batch combination is running successfully.

@sumanabasu
Copy link

sumanabasu commented Mar 1, 2023

@timruhkopf After leaving it aside for many months I just got it working this morning (after spending many hours yesterday!)!

Setting the mem_per_gpu flag didn't work for me, even when my admin confirmed the mem flag does not appear to be set at the backend. There's sure a clash, but we couldn't identify where. As a workaround, we skip the mem_per_gpu flag altogether, only specify the mem_per_cpu flag, and specify GPUs that we know have enough memory to serve my purpose.

My hydra launcher at this moment looks like this :

launcher:
    submitit_folder: ${hydra.sweep.dir}
    name: ${hydra.job.name}
    _target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
    additional_parameters:
      mem_per_cpu: 32GB
      cpus_per_gpu: 1
      gpus_per_task: 3g.39gb:1
      time: 1-23:00:00

This is how my hydra generated sbatch looks like:

#!/bin/bash

# Parameters
#SBATCH --array=0-8%9
#SBATCH --cpus-per-gpu=1
#SBATCH --error=/home/expt/20230301T130218%A_%a.err
#SBATCH --gpus-per-task=3g.39gb:1
#SBATCH --job-name=myjob
#SBATCH --mem-per-cpu=32GB
#SBATCH --open-mode=append
#SBATCH --output=/home/expt/20230301T130218%A_%a.out
#SBATCH --signal=USR2@120
#SBATCH --time=1-23:00:00
#SBATCH --wckey=submitit

# command
export SUBMITIT_EXECUTOR=slurm
srun --unbuffered --output /home/expt/20230301T130218/%A_%a_%t_log.out --error /home/expt/20230301T130218/%A_%a_%t_log.err /home/.conda/envs/venv/bin/python -u -m submitit.core._submit /home/expt/20230301T130218

Good Luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhanvement request
Projects
None yet
6 participants