Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow scheduler options import from configuration file #405

Merged
merged 18 commits into from May 12, 2020

Conversation

kathoef
Copy link
Member

@kathoef kathoef commented Apr 2, 2020

This is a possible implementation of #400. For a start, I have decided to go for as little extra code as possible and have added a config_scheduler_options variable that is set by a cluster-specific nested scheduler_options entry (as was suggested by @lesteve above).

However, in terms of consistency/design choices, I am not entirely sure about the scheduler_options entry being fully optional. The advantage is probably, that everybody who doesn't need this feature won't have to extend their already existing user-specific configuration files. The disadvantage is certainly, that configurable options are now inconsistently handled. If I see it correctly, every parameter that could be set from a configuration file is currently required to exist as an entry and I think it would be user-friendly to follow this logic also for the scheduler options (which means, however, extending jobqueue.yaml by even more redundant code lines, which I am a bit hesitant to do).

A general way out could be to implement configurable options not as currently done by

cores = dask.config.get("jobqueue.%s.cores" % self.config_name)

which throws a key error and terminates execution if an entry does not exist, but by

cores = dask.config.get("jobqueue.%s" % config_name).get('cores')

which returns None if an entry is not existing. If I haven’t overseen anything, the overall behaviour of the code would not change, because a required configuration entry cores: null is also simply translated into cores = None. Implementing configurable options like this would save lots of code lines and allow to set up very minimalistic configuration files.

On the other hand, it would certainly get less clear what is actually configurable. Maybe keeping it inconsistent, but properly documenting it, is the way to go… or keeping it minimalistic and properly documenting the available options?

Either ways, input is greatly appreciated...

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR! A few quick remarks:

  • you could add an empty entry in jobqueue.yaml. I guess adding scheduler_options: would be enough. It is a bit annoying for people updating dask-jobqueue because it may break their config. Because of this I would be in favour of doing .get for scheduler_options and live with the inconsistency ...
  • tests for this would be very nice to have because it's easy to get it wrong. I believe we already some similar tests look for dask.config.set in the tests

# scheduler_options overrides parameters common to both workers and scheduler
scheduler_options = dict(default_scheduler_options, **scheduler_options)

config_scheduler_options = dask.config.get("jobqueue.%s" % config_name).get('scheduler_options') or {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure but I would do something like this:

if scheduler_options is None:
    scheduler_options = dask.config.get("jobqueue.%s" % config_name, {})
    
scheduler_options = dict(default_scheduler_options, **scheduler_options)

What I am aiming at here is that the scheduler_options override completely the config scheduler_options rather than trying to combine both and have some keys set from the config and some key set from the parameter.

To me it feels simpler and less error-prone this way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am aiming at here is that the scheduler_options override completely the config scheduler_options rather than trying to combine both and have some keys set from the config and some key set from the parameter.

Thanks for putting that up! Actually, I had implemented what I did there very intentionally, because I like to have things flexible. However, I just had a look on how Dask jobqueue handles this for other, i.e. the (variable length) list-type parameters. No merge happening there, so I have decided to implement your suggestion to keep this consistent.

@kathoef
Copy link
Member Author

kathoef commented Apr 7, 2020

Thanks also for the helpful entry points on testing! I have come up with two scenarios that might be useful: One is checking that parameters provided in the config are actually used (to specify the scheduler options), the other checks if parameters manually passed completely overwrite those from the config.

(In the earlier version I also had tests that checked for the default behaviour, as well as passing the host options. However, I think that is already covered in the existing tests and does not really add value here, so I deleted them. Of course, the remaining tests should be moved into test_jobqueue_core.py, but I did not do that for now.)

Overall, I am not entirely sure if my generic_cluster_conf approach is good (thinking long-term here). It was the shortest implementation I could think of, but each time an option is added to one of the cluster types, it would also have to be added in that generic_cluster_conf variable. (I was also thinking about instead adding a scheduler_options test in each of the cluster type collections, to keep things together, but that felt like writing a lot of redundant code...)

you could add an empty entry in jobqueue.yaml. I guess adding scheduler_options: would be enough.

Good idea, that would at least indicate that a per-file configuration of the scheduler_options is possible. Should I add this in addition to having these parameters fully optional?

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code + test looks good as a first glance, thanks!

I am wondering 2 things

  • whether we should put an empty scheduler_options: section for all the schedulers in jobqueue.yaml. Up for debate.
  • whether there should be a test to read from a file with a non-trivial scheduler_options in the config file. This is to make sure this works. I think it should but having a test would be better because I am not 100% confident

@kathoef
Copy link
Member Author

kathoef commented Apr 11, 2020

whether we should put an empty scheduler_options: section for all the schedulers in jobqueue.yaml. Up for debate.

I think, there is more pro than contra here, so I have implemented it now. (In the config file, I've also took the freedom to move the extra: [] entries from the resource manager into the Dask worker sections for each cluster type. This mis-categorization had bothered me already some time ago, and just attracted my attention again. If you think this is more than a minor fix and should be a separate PR, please tell me, and I will undo that.)

whether there should be a test to read from a file with a non-trivial scheduler_options in the config file. This is to make sure this works. I think it should but having a test would be better because I am not 100% confident

I have started to implemented this, but then I thought that this is actually a topic of general relevance. If I see it correctly, all the config file options are currently tested only by manually defining a dictionary and adding it to the global dask.config pile, never is reading from a file involved. It might be good to have tests on more than just the scheduler_options, which might, however, better be tackled in a separate PR?

@kathoef
Copy link
Member Author

kathoef commented Apr 15, 2020

Thanks for repairing the CI, seems to have worked now!
(I hope, that simply merging these changes was the way to go.)

@kathoef kathoef requested a review from lesteve April 15, 2020 11:18
@lesteve
Copy link
Member

lesteve commented Apr 15, 2020

@kathoef sorry I am quite busy with other things, right now ... can you remote the "Draft" status of this PR (not sure how to do this I have to admit).

I started using github before this Draft status existed, what is the advantage out of curiosity ? I have always seen things like WIP in title that worked well enough ...

I'll try to review this PR before next Monday, don't hesitate to ping me if I don't do that !

@kathoef
Copy link
Member Author

kathoef commented Apr 15, 2020

@kathoef sorry I am quite busy with other things, right now ... can you remote the "Draft" status of this PR (not sure how to do this I have to admit).

Thanks for taking the time to leave a quick note! Yes, I can remove the draft status. (I have a "ready to review" button just above the "All checks have passed" part, maybe only I can see this?)

I started using github before this Draft status existed, what is the advantage out of curiosity ? I have always seen things like WIP in title that worked well enough ...

That's a very good question. I am relatively new to collaborative code work and to me this seemed like the clean way to go (when I opened this PR, I was sure that the code wasn't ready to be merged). Maybe it's actually a formal mechanism to ensure an agreement between developers and reviewers about when to merge? If a reviewer can't change the PR status, then this would be my interpretation. (In addition to that, unfortunately, I have absolutely no idea about how Github interprets this WIP title syntax. That's probably also why I have chosen the draft PR feature, it felt less like "doing it wrong".)

I'll try to review this PR before next Monday, don't hesitate to ping me if I don't do that !

No urgency, and yes I will, thank you!

@kathoef kathoef marked this pull request as ready for review April 15, 2020 17:09
Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have a few small comments.

BTW I forgot to review your PR before Monday as I promised, you should have pinged me ;-)

dask_jobqueue/core.py Show resolved Hide resolved
) as cluster:
scheduler_options = cluster.scheduler_spec["options"]
assert scheduler_options.get("interface") == pass_scheduler_interface
assert scheduler_options.get("port") is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment before this line to explain that if scheduler_options is passed as an argument it completely replaces the scheduler_options config value rather than overriding the keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be useful to add a comment about this. But shouldn't this be added also for the list-type entries/arguments that get fully overwritten? As the tests mostly don't have any comments, I am also wondering if is this a good / the best place to document this behaviour. (Maybe one could also move these two test cases into separate functions and give them more meaningful names? But, in my impression, this would break with the current style.)


pass_scheduler_interface = list(net_if_addrs.keys())[1]

generic_cluster_conf = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need generic_cluster_conf i.e. can you not pass a smaller dict without these keys set to their default value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for questioning this approach, I now came up with a shorter implementation. I guess, though, the drawback is the very "hacky" retrieval of the cluster's config_name. Any idea on how to improve that?

By thinking about where to retrieve the config_name variable, I also stumpled across an inconsistency for the cluster implementation's config_name variables. As it would be good practice to do so, I have opened a separate PR on it. (Should probably also have done that for moving these extra: [] entries in the configuration...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for questioning this approach, I now came up with a shorter implementation. I guess, though, the drawback is the very "hacky" retrieval of the cluster's config_name. Any idea on how to improve that?

Well, I have just solved that (stupid) "hacky" part... still open for suggestions on the present approach, though. ;)

kathoef and others added 2 commits April 30, 2020 18:40
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
@kathoef
Copy link
Member Author

kathoef commented Apr 30, 2020

BTW I forgot to review your PR before Monday as I promised, you should have pinged me ;-)

Yes, sorry... I had some medical intervention on Tuesday last week, and I guess this made several things slip off my table already on Monday... thanks for taking care.

@kathoef kathoef requested a review from lesteve April 30, 2020 17:58
@kathoef
Copy link
Member Author

kathoef commented May 4, 2020

Looks like the CI also failed because I was too sloppy... I have now restored the empty dictionary that allows the scheduler-options entry to be fully optional in the configurations (as scheduler-options are missing for the single cluster class configuration tests, these had failed).

I originally planned to move the tests into test_jobqueue_core.py. However, I have come to think that it might be a good idea to keep them separate (that's up for debate, however!), so I have tried to chose a more useful and less specific file name.

As the CI now has passed, could you have another look, @lesteve?

@@ -477,7 +477,9 @@ def __init__(
if interface is None:
interface = dask.config.get("jobqueue.%s.interface" % config_name)
if scheduler_options is None:
scheduler_options = {}
scheduler_options = dask.config.get(
"jobqueue.%s.scheduler-options" % config_name, {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain me why the {} is needed ? I would have thought that since scheduler-options is in jobqueue.yaml it should not be needed but the CI was failing without the {} so I must be missing something ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "problem" is that the configuration dictionaries used for testing are all set up explicitely, and thus won't inherit from jobqueue.yaml, see here for an example. (Copying that approach is actually how I came up with this generic_cluster_conf dictionary.)

Instead of having

dask.config.get("jobqueue.%s.scheduler-options" % config_name, {})

it would also be possible to add scheduler-options: {} for each of these configuration dictionaries used for testing of the specific clusters.

However, then we would actually require the existence of a scheduler-options entry for standalone configurations, which would break already existing user configurations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, then we would actually require the existence of a scheduler-options entry for standalone configurations, which would break already existing user configurations.

OK fair enough, for not breaking backward-compatibility, let's keep the .get

@lesteve
Copy link
Member

lesteve commented May 12, 2020

Sorry for taking such a long time to merge this one :-S. I don't have that much time to look at Dask-Jobqueue these days unfortunately ...

FYI I pushed a few tweaks:

  • changelog entry
  • move the tests back to test_jobqueue_core.py. In my mind they are a catch-all for tests that should work for all cluster classes. I agree it may need to be split at one point in the future but we'll cross that bridge when we get there
  • Also, I have an ongoing PR to make common tests like this one easier to write in WIP: Add unified test suite infrastructure #353. Leaving the tests in test_jobqueue_core.py makes it more likely that I manage to remember to take them into acccount.

@lpsinger
Copy link
Contributor

When will this patch be in a release?

@lesteve
Copy link
Member

lesteve commented Jul 27, 2020

Not sure maybe in one or two weeks if I get the chance to do a release by then, but there is no guarantee about this unfortunately.

In the meantime two work-arounds :

  • you can use scheduler_options parameter i.e. not through the config_file
  • you can install the dask-jobqueue development version (for example pip install git+https://github.com/dask/dask-jobqueue or update your requirement.txt or environment.yml in a similar fashion)

@andersy005
Copy link
Member

Not sure maybe in one or two weeks if I get the chance to do a release by then, but there is no guarantee about this unfortunately.

@lesteve, what are the pending tasks/PRs that need to be addressed prior the next release? I'm happy to help out

@lesteve
Copy link
Member

lesteve commented Jul 28, 2020

@lesteve, what are the pending tasks/PRs that need to be addressed prior the next release? I'm happy to help out

Thanks a lot for your help! I had a quick look at the open issues and I think I would need to figure out what to do about #196. I am leaning towards providing a work-around at the moment rather than changing the dask-jobqueue code so this would not be a blocker.

Feel free to have a quick look at the list of issues to see if I have missed something!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants