Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Submitit] Restructure the plugin configuration #682

Merged
merged 18 commits into from
Jun 22, 2020

Conversation

jrapin
Copy link
Contributor

@jrapin jrapin commented Jun 15, 2020

Motivation

Only use AutoExecutor for submitit plugin, since it is the recommended method

Todo

  • clarify parameter names (timeout_min -> timeout_minute(s)? gpus_per_node -> gpus_per_task (forward compatibility). TBD
  • clean DictConf instantiation in init
  • can we get rid of executor parameter (local/slurm)?
  • manual test on slurm
  • update documentation

Have you read the Contributing Guidelines on pull requests?

Yes

Test Plan

To do

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 15, 2020
Copy link
Contributor Author

@jrapin jrapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR does not work at all for now, it just states would should be the configs in my opinion
Note that it's actually shorter, simpler and actually provides more options.
We'll discuss names later, some may require renaming

@omry
Copy link
Collaborator

omry commented Jun 16, 2020

ask for review when ready.

@jrapin
Copy link
Contributor Author

jrapin commented Jun 16, 2020

ask for review when ready.

I wouldnt call it ready but I need help sorting this out. I couldn't make the config group work.

First thing to deal with in my opinion is what the structure of the conf should look like:
for now its:

folder
queue
params:
  gpus_per_task
  etc

I tried to change it to something like the following, removing one level

folder
queue
gpus_per_task
etc

but that makes the init of the plugin a bit more complicated.

When that is dealt with, the second question will of course be "how to change from local to slurm"

@omry
Copy link
Collaborator

omry commented Jun 16, 2020

The plugin is instantiated using hydra.util.instantiate.
The structure it expects is ObjectConf.
This is where the first params is coming from. This is not flexible.

There is an open rfc to eliminate that and support recursive instantiation (#566).
For now, keep the first params.

Try something like this (note that I renamed *Conf to *Params to reflect that I want to use them as the params object.

@dataclass
class BaseParams:
    # Where does this belong? maybe in base? 
    # if not move it to one of the the two specific configs
    folder: str = "${hydra.sweep.dir}/.${hydra.launcher.params.executor}"
    ...

@dataclass
class SlurmParams(BaseParams):
   ...

@dataclass
class LocalParams(BaseParams):
    ...


@dataclass
class SlurmConf(ObjectConf):
  cls: ...
  params: SlurmParams = SlurmParams()

@dataclass
class LocalConf(ObjectConf):
  cls: ...
  params: LocalParams = LocalParams()

# finally, register two different choices:
ConfigStore.instance().store(
    group="hydra/launcher",
    name="submitit_local",
    node=LocalConf,
    provider="submitit_launcher",
)

ConfigStore.instance().store(
    group="hydra/launcher",
    name="submitit_slurm",
    node=SlurmConf,
    provider="submitit_launcher",
)

There is a similar example here (without the ObjectConf).

@omry
Copy link
Collaborator

omry commented Jun 16, 2020

An alternative style which might be easier to read is this:
(Not sure which one I prefer yet).
I do this in Hydra's config when it's not getting too deep.

@dataclass
class BaseParams:
  ...

@dataclass
class SlurmConf(ObjectConf):
  cls :str = "hydra_plugins.....SubmititLauncher"

  @dataclass
  class SlurmParams(BaseParams):
     ...

  params: SlurmParams = SlurmParams()

@dataclass
class LocalConf(ObjectConf):
  cls :str = "hydra_plugins.....SubmititLauncher"
  @dataclass
  class LocalParams(BaseParams):
      ...

  params: LocalParams = LocalParams()

Copy link
Contributor Author

@jrapin jrapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the help.
It's now working locally. There are a few questions left before I move to test on manually in slurm (see comments)

# cluster to use (currently either "slurm" or "local" are supported,
# None defaults to an available cluster)
cluster: str = "slurm"
executor: ExecutorName = ExecutorName.slurm # can we get rid of this?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed here? Is there another way to get this information since we now call submitit_local or submitit_slurm. That would prevent a user from setting it to the wrong value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now you are using this as the flag to gate slurm or local initialization logic.
An alternative I think will be much cleaner is to use a different class. I will comment below.

Comment on lines 39 to 43
self.params = OmegaConf.structured(
LocalParams if params["executor"].value == "local" else SlurmParams
)
for key, val in params.items():
OmegaConf.update(self.params, key, val if key != "executor" else val.value)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this be simplified?

}
)
init_keys = specific_init_keys | set(init_renamer) # used config keys
executor = submitit.AutoExecutor(**init_params)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now everything goes through AutoExecutor, whatever the choice of the executor. This makes it more maintainable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can tell, it takes some trickery to prepare the correct dictionary for the AutoExecutor.
The reason is that the input config is no longer lining up with AutoExecutor parameters.

If you used SlurmExecutor and LocalExecutor directly, with corresponding SubmititSlurmLauncher and SubmititLocalLauncher as I suggested above I think you would be able to more easily match the init signature of each to what you want to pass the respective submitit launcher.

You can still use auto, but at least in this case the logic in each of the launchers will need to initialize a known mode which will simplify things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you used SlurmExecutor and LocalExecutor directly, with corresponding SubmititSlurmLauncher and SubmititLocalLauncher as I suggested above I think you would be able to more easily match the init signature of each to what you want to pass the respective submitit launcher.

I won't call SlurmExecutor and LocalExecutor directly, that would make the plugin harder to maintain. The alignment will actually be much simpler to maintain. Also, there's a PR on its way in submitit that would make it even easier.

Copy link
Collaborator

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this more.
see my comments, I think we can simplify more by splitting into two different launcher classes.
they can inherit from your current launcher and have custom __init__() methods.

# cluster to use (currently either "slurm" or "local" are supported,
# None defaults to an available cluster)
cluster: str = "slurm"
executor: ExecutorName = ExecutorName.slurm # can we get rid of this?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now you are using this as the flag to gate slurm or local initialization logic.
An alternative I think will be much cleaner is to use a different class. I will comment below.



@dataclass
class SubmititLauncherConf(ObjectConf):
class LocalConf(ObjectConf):
cls: str = "hydra_plugins.hydra_submitit_launcher.submitit_launcher.SubmititLauncher"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can define a launcher class for each mode and have and have the custom logic in it.
They can share a superclass or otherwise reuse most of the logic through other means.

  cls: str = "hydra_plugins.hydra_submitit_launcher.submitit_launcher.SubmititLocalLauncher"

This will also allow you to named parameters in the __init__ of the class:

class SubmititLauncher(Launcher):
    def __init__(self, 
      partition: Optional[str],
      comment: Optional[str],
      constraint: Optional[str],
      exclude: Optional[str],
      ...
) -> None:
        # forward directly to the implementation
        # (my preference would be to SlurmExecutor and LocalExecutor at this point).
        ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not double down the name of each parameter both in the config and in the init. Naming them here would just add code for nothing in my opinion (we don't need type hints here since it's never called directly by a user, and all the checks can be efficiently performed through the config). So is there a shortcut to initialize the omega conf from a dictionary? I couldnt find any without requiring a for loop

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You asked how to simplify. this is my advice.
I don't think you will need to double much:
between config inheritance, and passing the actually needed parameters into each init there will be no duplication in my opinion.

regardless, it's your call.
Changing submitit can also help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but is there a better way to build the OmegaConf structure from a dict. Currently I am doing:

        self.params = OmegaConf.structured(LocalParams)
        for key, val in params.items():
            OmegaConf.update(self.params, key, val)

bur I would think I am missing something (and it would not work with Enum, but fortunately there is no more Enum now).
Otherwise I'll use the Config directly but I prefer the flexibility of OmegaConf

Copy link
Collaborator

@omry omry Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try :

self.params = OmegaConf.structured(LocalParams(**params))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general answer to:

but is there a better way to build the OmegaConf structure from a dict

is:

OmegaConf.create({})

In this example you want to get the type safety from the dataclass.
Here is some context:
https://omegaconf.readthedocs.io/en/latest/structured_config.html#merging-with-other-configs

You can also try what I suggested, which is possibly a better match here.

}
)
init_keys = specific_init_keys | set(init_renamer) # used config keys
executor = submitit.AutoExecutor(**init_params)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can tell, it takes some trickery to prepare the correct dictionary for the AutoExecutor.
The reason is that the input config is no longer lining up with AutoExecutor parameters.

If you used SlurmExecutor and LocalExecutor directly, with corresponding SubmititSlurmLauncher and SubmititLocalLauncher as I suggested above I think you would be able to more easily match the init signature of each to what you want to pass the respective submitit launcher.

You can still use auto, but at least in this case the logic in each of the launchers will need to initialize a known mode which will simplify things


# specify resources/parameters
baseparams = set(dataclasses.asdict(BaseParams()).keys())
print(baseparams)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget the prints.

Comment on lines 42 to 44
self.params = OmegaConf.structured(
LocalParams if params["executor"].value == "local" else SlurmParams
LocalParams if self._EXECUTOR == "local" else SlurmParams
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a base class to express that the slurm launcher is not REALLY a local launcher.

# your current local:
class BaseSubmititLauncher(Launcher):
  def _get_params_class() -> Type[BaseParams]:
     raise NotImplementedError

  ... # the rest is the same.

class LocalSubmititLauncher(BaseSubmititLauncher)
  def _get_params_class() -> Type[BaseParams]:
     return LocalParams 

class SlurmSubmititLauncher(BaseSubmititLauncher)
  def _get_params_class() -> Type[BaseParams]:
     return SlurmParams

You can go back to a DictConfig with something like this (and now you don't need the update loop)

self.params = OmegaConf.structured(self._get_params_class(**params))

Comment on lines 45 to 46
for key, val in params.items():
OmegaConf.update(self.params, key, val if key != "executor" else val.value)
OmegaConf.update(self.params, key, val)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that worked in the previous version when you use the actual object and not a DictConfig.

regardless, you can get rid of this loop with the example I gave you above.

@jrapin jrapin changed the title [WIP] [Submitit] Use only auto executor [Submitit] Use only auto executor Jun 19, 2020
@jrapin
Copy link
Contributor Author

jrapin commented Jun 19, 2020

@omry it's now manually tested and working on SLURM cluster.
In my opinion it seems ready, let me know if it requires more updates.

# cluster to use (currently either "slurm" or "local" are supported,
# None defaults to an available cluster)
cluster: str = "slurm"
submitit_folder: str = "${hydra.sweep.dir}/.submitit/%j"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's talk about this for a moment:
With Hydra, each job is getting a working directory inside the sweep dir.

If we can keep all the files for the job in that directory it would be really good (users will have an easier time debugging issues for a particular job).
This may take some submitit changes, but is something to consider for the future.

Right now, if a job is failing the user will need to:

  1. Look at the log file in the job output directory.
  2. Look in the submitit output folder, figure out which files belong to the job that caused the issue somehow and check the output/error files.

If we managed to use the working directory here it would be better.
I suspect currently this is not possible without changes to submitit.

An alternative solution to this problem is to symlink the stdout and stderr from the submitit folder into the job working directory. But this can also be tricky.

This is not a feature request for this PR, but something to think about to improve the user experience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can keep all the files for the job in that directory it would be really good

Indeed (for now this is the same behavior as before, but I agree it is not convenient)

I suspect currently this is not possible without changes to submitit.

It would actually be possible if not using arrays, but that's not a good option. With arrays on the other hand we can't specify the exact folder of each run individually, it must be deduced from the parameters of the array. Hacking something is somehow possible but would require a lot of (mostly unhelpful) changes in submitit and would not be robust anyway.

The other option would be that hydra uses the folder provided by submitit (using the job_id instead of 0, 1, 2 etc). But that's probably not something you'd want either, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user can customize the sweep subdirectory through Hydra.
So yeah, I can't really use the submitit generated dir.

}
)
init_keys = specific_init_keys | {"submitit_folder"}
executor = submitit.AutoExecutor(cluster=self._EXECUTOR, **init_params)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is confusing to me:

The auto executor is expecting the strings "local" or "slurm" as options for cluster?

```

Note that this plugin expects a valid environment in the target host. usually this means a shared file system between
the launching host and the target host.

Submitit supports 3 types of queues: auto, local and slurm. Its config looks like this
Submitit actually implements 2 different launchers: `submitit_slurm` to run on a SLURM cluster, and `local` to test locally.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about the following?

  1. Include links to config.py with dataclasses instead of copying them in.
  2. include the outputs of
$ python foo.py hydra/launcher=submitit_slurm --cfg hydra -p hydra.launcher
$ python foo.py hydra/launcher=submitit_local-cfg hydra -p hydra.launcher

When describing each of the two launchers.
I am still trying to discover what is the best way to document plugins with structured configs, and I think I like the above proposal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I saw that you already added similar output.
Also since the primary use case is SLURM we can focus on it.

This will output the current slurm parameters:

python foo.py hydra/launcher=submitit_slurm --cfg hydra -p hydra.launcher

and I recommend that instead of inlining the dataclasses we can just give a link.

Also, an example of how to override the launcher parameters from the command line would be helpful:

python foo.py hydra/launcher=submitit_slurm hydra.launcher.params.partition=prod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I saw that you already added similar output.

Yes I find them the most interesting since they make it easier to discover settings than just stating the config classes.

and I recommend that instead of inlining the dataclasses we can just give a link.

I have shared feelings about this, since the interest of adding the config files there came from the description of each parameters. Then again, since it often ends up being outdated if not careful, it's probably better to just have the link.

So I just:

  • removed the config classes
  • added a link the the config classes
  • added the discovery of submitit_local parameters
  • added a commandline override example

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point I will support field description in the help or --cfg. when that happens we will be able a better description here.

Copy link
Collaborator

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. please respond to my suggestion to change the docs.

```

Note that this plugin expects a valid environment in the target host. usually this means a shared file system between
the launching host and the target host.

Submitit supports 3 types of queues: auto, local and slurm. Its config looks like this
Submitit actually implements 2 different launchers: `submitit_slurm` to run on a SLURM cluster, and `local` to test locally.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I saw that you already added similar output.
Also since the primary use case is SLURM we can focus on it.

This will output the current slurm parameters:

python foo.py hydra/launcher=submitit_slurm --cfg hydra -p hydra.launcher

and I recommend that instead of inlining the dataclasses we can just give a link.

Also, an example of how to override the launcher parameters from the command line would be helpful:

python foo.py hydra/launcher=submitit_slurm hydra.launcher.params.partition=prod

@omry omry changed the title [Submitit] Use only auto executor [Submitit] Restructure the plugin configuration Jun 19, 2020
jrapin and others added 2 commits June 22, 2020 11:40
…auncher/submitit_launcher.py

Co-authored-by: Omry Yadan <omry@fb.com>
@jrapin
Copy link
Contributor Author

jrapin commented Jun 22, 2020

ci/circleci: py36_win seems broken :s

Copy link
Collaborator

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reran the windows thing, probably transient infra issue.

# cluster to use (currently either "slurm" or "local" are supported,
# None defaults to an available cluster)
cluster: str = "slurm"
submitit_folder: str = "${hydra.sweep.dir}/.submitit/%j"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user can customize the sweep subdirectory through Hydra.
So yeah, I can't really use the submitit generated dir.

Comment on lines +78 to +84
# finally, register two different choices:
ConfigStore.instance().store(
group="hydra/launcher",
name="submitit_local",
node=LocalConf,
provider="submitit_launcher",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can lose LocalConf and SlurmConf class definitions without losing functionality or clarity.

ConfigStore.instance().store(..., node=ObjectConf(cls: str ="XYZClass", params = XYZParams()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have done something wrong but I get:
omegaconf.errors.UnsupportedValueType: Value 'LocalParams' is not a supported primitive type

Copy link
Collaborator

@omry omry Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is LocalParams a dataclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a BaseParams, which is a dataclass, does it need a decorator as well? I'm not really familiar with dataclasses

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to create a minimal example showing the problem with OmegaConf directly.
this seems to work and is very close to what you are doing.

In [1]: from dataclasses import dataclass
In [9]: from typing import Any

In [2]: from omegaconf import OmegaConf

In [3]: @dataclass
   ...: class Foo:
   ...:     num : int = 10
   ...: 

In [4]: @dataclass
   ...: class LocalFoo(Foo):
   ...:     name: str = "name"
   ...: 

In [5]: OmegaConf.structured(LocalFoo)
Out[5]: {'num': 10, 'name': 'name'}

In [6]: OmegaConf.structured(LocalFoo())
Out[6]: {'num': 10, 'name': 'name'}

In [7]: OmegaConf.structured({"add-hoc":LocalFoo()})
Out[7]: {'add-hoc': {'num': 10, 'name': 'name'}}



In [10]: @dataclass
    ...: class ObjectConf:
    ...:     cls: str = "abc"
    ...:     params: Any = None
    ...: 

In [11]: OmegaConf.structured(ObjectConf(cls="xyz", params=LocalFoo()))
Out[11]: {'cls': 'xyz', 'params': {'num': 10, 'name': 'name'}}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omry
Copy link
Collaborator

omry commented Jun 22, 2020

I increased the no-output timeout for windows. can you rebase and try again?
also see my comment about simplifying config.py a bit.

@omry
Copy link
Collaborator

omry commented Jun 22, 2020

going to land it, I am having some test failures with the existing submitit plugin after making some core changes and it will be best to fix them on top of your changes.
I will try to simplify the config as I suggested.

@omry omry merged commit 556bc93 into facebookresearch:master Jun 22, 2020
@omry
Copy link
Collaborator

omry commented Jun 22, 2020

One thing I noticed that is missing is a news fragment. especially for breaking changes like this one we should add them.
I will do it for this one.

@jrapin jrapin deleted the only_auto branch June 23, 2020 13:18
@jrapin
Copy link
Contributor Author

jrapin commented Jun 23, 2020

One thing I noticed that is missing is a news fragment. especially for breaking changes like this one we should add them.
I will do it for this one.

Thanks for this and for merging ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants