Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix name parameters being ignored in Cluster class #398

Merged
merged 14 commits into from
May 29, 2020

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Mar 24, 2020

Fix #386 and a few other quirks I found along the way. This may be split in separate PRs, for example depending on what happens with #397 (Closes #397).

At the moment the main thing missing is:

  • implement validate_job_kwargs to make sure that the job kwargs exist either in JobQueueCluster.__init__ (base class) or self.job_cls.__init__ (derived class) and have a user-friendly error at the cluster-level. This may not work as a general solution (in case there are classes between the base and most derived class) but probably good enough as a first step. Or someone knows how to google this kind of Python design patterns and maybe something already exists I don't know.
  • tests
  • fix bugs highlighted by the tests

@martindurant
Copy link
Member

@lesteve , are you still working here? If you are done, I suggest you ping one of the maintainers for review.

@lesteve
Copy link
Member Author

lesteve commented May 18, 2020

I need to add some tests ... as the main active Dask-Jobqueue maintainer I guess I will have to ping myself once this is done 🙃

Out of curiosity do you have a particular interest in this PR ?

@martindurant
Copy link
Member

Yes, figure it out, sorry; trying to check in on several dask repos at once!

@martindurant
Copy link
Member

(but no, this is just one of a few PRs that seem to be green but marked WIP on the timescale of ~months)

@lesteve
Copy link
Member Author

lesteve commented May 18, 2020

No worries, if you have some suggestion on the approach while you are here, this would be more than welcome ...

This feels like one of this object-oriented thing where there must be a better way, but I just can not find it. Maybe I should have read more design patterns book when I was younger rather than studying Physics oh well ...

@martindurant
Copy link
Member

Yeah, passing around lots of kwargs is painful. I suppose you could have some config dictionary somewhere and work with that. I don't have much of a concrete improvement proposal for you though.

This line:

sig = inspect.signature(func)

is particularly painful, although I understand you don't want TypeErrors.

@lesteve
Copy link
Member Author

lesteve commented May 19, 2020

Thanks a lot for having a look!

is particularly painful, although I understand you don't want TypeErrors.

Yeah the problem is that the error happens in a class (job-level) that is quite far from the user (cluster-level) and I wanted a reasonably user-friendly error message. I agree it feels less than ideal ...

@martindurant
Copy link
Member

The alternative would perhaps be requiring the job classes to accept arbitrary kwargs and ignore/error on unexpected items in the class itself.

@lesteve
Copy link
Member Author

lesteve commented May 19, 2020

I think I thought about that at one point but it felt like the error message would not be as helpful for the user since the user creates a FooCluster and does not even know there is a FooJob involved which is more an internal implementation detail ...

I thought doing that and catching the job level exception at the cluster level for a better opportunity at a helpful error message but I was not really convinced either.

I'll try to have a second look at one point.

I think one of my main worry currently is that my inspect logic is quite fragile and may not work at all in some complicated class hierarchy ... at the same time it is not that likely to happen since the Job classes are internal implementation details so we control them in a way.

Thanks a lot for looking into anyway, I certainly appreciate the suggestions!

@lesteve lesteve changed the title [WIP] Fix name parameters being ignored in Cluster class and protocol + security not being passed through Fix name parameters being ignored in Cluster class and protocol + security not being passed through May 26, 2020
@lesteve
Copy link
Member Author

lesteve commented May 26, 2020

I am removing the WIP from this PR. The main thing I'd like to have some feedback on is the strategy on giving a reasonable error message to the user when he uses a parameter that does not exist in FooCluster. cc @guillaumeeb @willirath.

Here are the possible strategy I can see:

  1. don't do anything. The error message is not that great (the error happens in the FooJob class that the user does not know about) but at least the code is simple.
    Snippet:
python -c 'from dask_jobqueue import SLURMCluster; \
    cluster = SLURMCluster(\
        cores=1,memory="1GB",wrong_parameter="wrong_parameter_value")'

Error message:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 529, in __init__
    self._dummy_job  # trigger property to ensure that the job is valid
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 557, in _dummy_job
    return self.job_cls(
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/slurm.py", line 30, in __init__
    super().__init__(
TypeError: __init__() got an unexpected keyword argument 'wrong_parameter'
  1. use inspect to find out in advance what the allowed parameters are. Best error message (we can give a list of all allowed parameters) but code using inspect is fragile and may not work for complicated classes hierarchies.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 521, in __init__
    self.validate_job_kwargs()
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 645, in validate_job_kwargs
    raise ValueError(
ValueError: Wrong parameters: {'wrong_parameter': 'wrong_parameter_value'}.
Here are the list of allowed parameters: ['config_name', 'cores', 'death_timeout', 'env_extra', 'extra', 'header_skip', 'interface', 'job_cpu', 'job_extra', 'job_mem', 'job_name', 'local_directory', 'log_directory', 'memory', 'name', 'nanny', 'processes', 'project', 'protocol', 'python', 'queue', 'scheduler', 'security', 'shebang', 'walltime']
  1. catch TypeError and gives a better error message. Reasonable error message and not too fragile code. To me after some back and forth this seems the least worst of all the options but it's not like I was fully convinced ...
Traceback (most recent call last):
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 556, in _dummy_job
    return self.job_cls(
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/slurm.py", line 30, in __init__
    super().__init__(
TypeError: __init__() got an unexpected keyword argument 'wrong_parameter'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 527, in __init__
    self._dummy_job  # trigger property to ensure that the job is valid
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 569, in _dummy_job
    raise ValueError(
ValueError: Got unexpected keyword argument 'wrong_parameter'. Very likely this unexpected parameter was passed in "job_kwargs" in the SLURMCluster constructor:
job_kwargs={'cores': 1, 'memory': '1GB', 'wrong_parameter': 'wrong_parameter_value', 'config_name': 'slurm', 'interface': None, 'protocol': 'tcp://', 'security': None}

The switch between 2. and 3. was done in this commit so you can get a better feel of the code involved.

Any feed-back on the rest of this PR more than welcome!

@martindurant
Copy link
Member

I am certainly -1 on possibility 2, even if the target classes tend to be simple. Of the remaining two, I am ambivalent - a raw TypeError seems simple enough to me, and keeps the code simple, but I can see why users wouldn't want to dig to find out what arguments were in fact expected.

@lesteve lesteve changed the title Fix name parameters being ignored in Cluster class and protocol + security not being passed through Fix name parameters being ignored in Cluster class May 29, 2020
@lesteve lesteve merged commit f199cf9 into dask:master May 29, 2020
@lesteve lesteve deleted the simplify-job-kwargs branch May 29, 2020 16:06
matyasselmeci added a commit to matyasselmeci/dask-jobqueue that referenced this pull request Jun 16, 2020
* master:
  Tweak name in cluster.job_script(). (dask#439)
  Switch from Travis to GitHub Actions (dask#435)
  Fix name parameters being ignored in Cluster class (dask#398)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster classes accept any named parameters without error
2 participants