Skip to content

Overhaul fuse() config#6198

Merged
jcrist merged 3 commits intodask:masterfrom
crusaderky:config
May 16, 2020
Merged

Overhaul fuse() config#6198
jcrist merged 3 commits intodask:masterfrom
crusaderky:config

Conversation

@crusaderky
Copy link
Copy Markdown
Collaborator

Overhaul the config file and default values for fuse().

  • Now all defaults are properly and uniquely defined in dask.yaml. Removed all hardcoded defaults in the code.
  • The schema now fully covers all config settings
  • Overhauled documentation
  • Fixed bug where a user with subgraphs: true would explicitly set fuse_subgraphs=False when invoking fuse() and have it ignored
  • It is now possible to have a static value in the config for max-width and max-depth-new-edges and explicitly pass None when invoking fuse to make them revert to the dynamic calculation

Caveats: none. I was extremely conservative in the change so there shouldn't be anything at risk of breaking. This came at the cost of a rather unsightly and inconsistent design in the optimization functions for array and bag - where I strictly retained the previous behaviour.

From looking at the code I see that many other config settings could use the same kind of attention - but they would substantially increase the scope of the change, so I'd rather leave them to a later PR.

@crusaderky
Copy link
Copy Markdown
Collaborator Author

This is ready for review and merge

@crusaderky
Copy link
Copy Markdown
Collaborator Author

@mrocklin @jcrist could you review or nominate somebody for review please?

@lr4d
Copy link
Copy Markdown
Contributor

lr4d commented May 15, 2020

This came at the cost of a rather unsightly and inconsistent design in the optimization functions for array and bag - where I strictly retained the previous behaviour

Would it make sense to define fuse configuration parameters tied to a specific collection to make configuring this more straightforward across the board?

@crusaderky
Copy link
Copy Markdown
Collaborator Author

@lr4d yes that's definitely a clean option, although it would break backwards compatibility

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented May 15, 2020 via email

Copy link
Copy Markdown
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @crusaderky, overall this looks like a nice change. Just one small nit then ready for merge.

return (_enforce_max_key_limit(concatenated_name),) + first_key[1:]


_use_config = object()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places we use "__no_default__" for this. Since tooling like sphinx, ipython, etc... will repr the defaults in the signature, having something more human-readable can be nice.

Alternatively, in skein I use something like:

# overriding `__reduce__` lets the option be properly picklable as a singleton as well.
default = type('default', (object,),
                dict.fromkeys(['__repr__', '__reduce__'],
                              lambda s: 'default'))()

for the same purpose.

Either is fine, but I'd prefer something other than a raw object().

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, see if you like it now

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the issue wasn't around typing or any of that, but more how it displays to the user in ipython/sphinx/etc.... Using an enum for this is fine, provided it has a descriptive repr (UseConfig.token does not).

We don't currently use python type annotations in dask, adding them is a separate issue and shouldn't be done here (IMO).

Perhaps

In [12]: class Default(enum.Enum):
    ...:     value = ""
    ...:     def __repr__(self):
    ...:         return "<default>"
    ...:
    ...:

In [13]: def foo(x=Default.value):
    ...:     pass
    ...:

In [14]: foo?
Signature: foo(x=<default>)
Docstring: <no docstring>
File:      ~/Code/dask/<ipython-input-13-ed0fa1715aa4>
Type:      function

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified as requested.

Sphinx HTML rendered output:

dask.optimization.fuse(dsk, keys=None, dependencies=None, ave_width=<default>,
max_width=<default>, max_height=<default>, max_depth_new_edges=<default>,
rename_keys=<default>, fuse_subgraphs=<default>)

@lr4d
Copy link
Copy Markdown
Contributor

lr4d commented May 15, 2020

@lr4d yes that's definitely a clean option, although it would break backwards compatibility

Couldn't we introduce some config parameters in the form of dask.bag.fuse.rename_keys where we set the current default values for these? e.g. for dask.bag as defined in dask.bag.core.optimize
I'm not saying this should be done in this PR, just wondering if it makes sense.

@crusaderky
Copy link
Copy Markdown
Collaborator Author

@lr4d What you say makes sense and it's a clean design if backwards compatibility is not an issue.
The problems happen when an existing user was controlling the behaviour of bag through optimization.fuse.rename_keys - which would make sense for a user that uses bag exclusively. Such a user will see its config suddendly ignored and overriden by the default dask.bag.fuse.rename_keys, and dask won't have any way of even warning him that his setting is now being ignored.

@crusaderky crusaderky closed this May 15, 2020
@crusaderky crusaderky reopened this May 15, 2020
@jcrist jcrist merged commit 1a75a44 into dask:master May 16, 2020
@jcrist
Copy link
Copy Markdown
Member

jcrist commented May 16, 2020

Thanks!

@jorisvandenbossche
Copy link
Copy Markdown
Member

On master, Travis reports some failed builds with "KeyError: 'optimization'" in config.py (eg https://travis-ci.org/github/dask/dask/jobs/688159707, https://travis-ci.org/github/dask/dask/jobs/688159706).

And the reason I noticed is that I see the same error in our nightly tests of dask master with pyarrow master in Apache Arrow.

@jcrist
Copy link
Copy Markdown
Member

jcrist commented May 18, 2020

This is because dask doesn't depend on yaml, calls to dask.config.get need to include a default in the code to support cases where the default dask.yaml file isn't read. I knew this, but forgot to check it for this PR, which shows how easy it is to miss this. We also only test minimal installs on a cron job.

I've opened #6221 to discuss making yaml a required dependency, since that seems more tenable than forcing duplication of defaults between the config file and the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants