Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration Reference #6069

Merged
merged 19 commits into from Apr 24, 2020
Merged

Conversation

quasiben
Copy link
Member

@quasiben quasiben commented Apr 6, 2020

This PR adds a configuration page to the docs. As the configuration rather large, I'd like to finish up both scheduler and worker configs while getting consensus on style. In subsequent PRs I/we can continue adding configuration details.

Fixes #4286

I've been using this online RST editor

Attached is a screen shot of the current setup:
Screen Shot 2020-04-06 at 11 25 16 AM

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woo, thanks for working on this @quasiben! Looking forward to seeing this added. It appears the configuration-reference.rst file hasn't yet been included

@mrocklin
Copy link
Member

mrocklin commented Apr 6, 2020

I have two comments that I think we should consider:

Hierarchy

I'm curious about how best to handle hierarchy. Dask's configuration is nested, which I think helps with keeping things compartmentalized. The current structure has one level of nesting, followed up with a flattened hierarchy. This is a sensible choice, but there might be other choices.

One approach would be to use subsections and subsubsections to handle the nesting, and then use a Sphinx TOC on top. This would capture the hierarchical nature of our configuration and also maybe make it easy for users to navigate quickly to bits that they care about. It also might be way more confusing. I'm not sure.

Single source of truth

We're now copying configuration into two places. This might be necessary, but it's also something that we should be aware of and avoid if possible (although I have no suggestions on how best to avoid it).

@mrocklin
Copy link
Member

mrocklin commented Apr 6, 2020

One way to handle the single source of truth would be to have way more extensive comments in the config files themselves, and then autogenerate the docs. This has tradeoffs though.

@quasiben
Copy link
Member Author

quasiben commented Apr 6, 2020

@mrocklin, I briefly thought about subsections but was lazy -- as a test, I opted for the following:

When defining nested configurations, the top level default value will be blank, with subsequent keys and values listed below

The single source of truth is important but I am -0.5 on adding more comments to the config. With more text in the config it makes editing the YAML rather challenging and becomes error prone. It's also worth noting that dask has configuration management in two locations:

  • distributed.yaml
  • dask.yaml

@quasiben quasiben changed the title initial setup of configuration reference Configuration Reference Apr 6, 2020
@quasiben
Copy link
Member Author

quasiben commented Apr 6, 2020

I am looking into a smallish sphinx plugin + ruamel.yaml for loading comments

@mrocklin
Copy link
Member

mrocklin commented Apr 6, 2020

My hope is that using subsections wouldn't be terrible. It might also help to avoid the foo.bar.baz formatting in the table (which is liable to get long).

Let's take a look at the scheduler section

  scheduler:
    allowed-failures: 3     # number of retries before a task is considered bad
    bandwidth: 100000000    # 100 MB/s estimated worker-worker bandwidth
    blocked-handlers: []
    default-data-size: 1000
    # Number of seconds to wait until workers or clients are removed from the events log
    # after they have been removed from the scheduler
    events-cleanup-delay: 1h
    idle-timeout: null      # Shut down after this duration, like "1h" or "30 minutes"
    transition-log-length: 100000
    work-stealing: True     # workers should steal tasks from each other
    work-stealing-interval: 100ms  # Callback time for work stealing
    worker-ttl: null        # like '60s'. Time to live for workers.  They must heartbeat faster than this
    pickle: True            # Is the scheduler allowed to deserialize arbitrary bytestrings
    preload: []
    preload-argv: []
    unknown-task-duration: 500ms  # Default duration for all tasks with unknown durations ("15m", "2h")
    default-task-durations:  # How long we expect function names to run ("1h", "1s") (helps for long     tasks)
      rechunk-split: 1us
      shuffle-split: 1us
    validate: False         # Check scheduler state at every step for debugging
    dashboard:
      status:
        task-stream-length: 1000
      tasks:
        task-stream-length: 100000
      tls:
        ca-file: null
        key: null
        cert: null
    locks:
      lease-validation-interval: 10s  # The time to wait until an acquired semaphore is released if the  Client goes out of scope

So there are a lot of entries in the top level, which is great. The table will be simple for them. There are some larger sections, like scheduler.dashboard, which I think would make sense to break out to a subsection, and there are some smaller sections, like locks with a single element, that may not make sense to break out.

I think that it probably won't be hard to make subsections (the RST for this is fortunately simple). In some cases I think that we probably do want subsections, and in other cases the subsections might be small enough to even be a hindrance. I don't think that we want to mix approaches, but I'm not sure.

(I'm just thinking out loud here to get conversation going)

@mrocklin
Copy link
Member

mrocklin commented Apr 6, 2020

The single source of truth is important but I am -0.5 on adding more comments to the config. With more text in the config it makes editing the YAML rather challenging and becomes error prone.

Yeah, to be clear, I'm not pushing this. I'm just dumping a bunch of options out there to spur some thinking. This is going to be a lot of detailed work once we get something down. I'd prefer that we only do that detailed work once.

@quasiben
Copy link
Member Author

quasiben commented Apr 6, 2020

In #b06c46 I played with building an plugin to automate the table (just for fun). It's not awful :)

Below is a screen shot
Screen Shot 2020-04-06 at 5 53 22 PM

if comment_token is None:
comment = ""
else:
comment = comment_token[2].value
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically you could pull comments from another location besides the original yaml spec. However, this may come back around to single-source of truth issues

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In dask-gateway @jcrist uses traitlets to store all of this. I'm not sure that Dask should go that far, but it's a demonstration of this approach.

This could also be useful for input validation

@jcrist
Copy link
Member

jcrist commented Apr 6, 2020

Woot, this is nice to see @quasiben. I do agree with @mrocklin's concern about two sources of truth though.

For dask client-side configuration, lately I've been ensuring the default file is heavily (and nicely) commented, and then directly including that in the docs. Examples:

As another idea, the zero-to-jupyterhub project includes an extra validation script that tests that the values.yaml file matches the documented schema (see https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/jupyterhub/validate.py). The schema is stored in json, which makes it easier to use it to generate a nice html table for the docs.

*I personally prefer the traitlets approach, but recognize that not everyone might prefer that approach, and moving from Dask's existing configuration system would be tricky. *

@mrocklin
Copy link
Member

mrocklin commented Apr 6, 2020

*I personally prefer the traitlets approach, but recognize that not everyone might prefer that approach, and moving from Dask's existing configuration system would be tricky. *

Heh. That's me! :)

Yeah, I personally would be sad with traitlets I think (although I probably should do more homework on it). There is probably some happier medium though.

For dask client-side configuration, lately I've been ensuring the default file is heavily (and nicely) commented, and then directly including that in the docs.

Those are nice.

As another idea, the zero-to-jupyterhub project includes an extra validation script that tests that the values.yaml file matches the documented schema (see https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/jupyterhub/validate.py). The schema is stored in json, which makes it easier to use it to generate a nice html table for the docs.

A companion file like this might not be bad, especially if adding to it could be done optionally and users didn't see it directly.

@quasiben
Copy link
Member Author

quasiben commented Apr 7, 2020

A companion file like this might not be bad, especially if adding to it could be done optionally and users didn't see it directly.

A companion file would need to duplicate the keys, no ? Maybe that's ok and it's a baby step in the direction of validation ?

@jcrist
Copy link
Member

jcrist commented Apr 7, 2020

A companion file would need to duplicate the keys, no ? Maybe that's ok and it's a baby step in the direction of validation ?

To be clear, I think the value of the JupyterHub spec file approach is the validation script. So while the information is split between files (which could fall out of sync), there's a way to test that they remain in sync and that test could be run as part of our CI.

@mrocklin
Copy link
Member

Here is an attempt with jsonschema (the thing that JuptyerHub uses)

dask/distributed#3696

So far I'm decently happy with it.

@quasiben
Copy link
Member Author

I spent a bit of time with @mrocklin this weekend using PR dask/distributed#3696 to help with descriptions in the table. We pushed on getting the table to render everything and while it works, the rendering is a bit off:

Screen Shot 2020-04-12 at 11 37 27 PM

I also spent some time adding a custom yaml to HTML styling. This non-csv layout of the config perhaps balances verbose descriptions (which users will appreciate) without sacrifice layout/readability (which the csv-table does). However, we lose fancy TOC linking:

Screen Shot 2020-04-12 at 11 32 13 PM

@mrocklin
Copy link
Member

mrocklin commented Apr 13, 2020 via email

@jcrist
Copy link
Member

jcrist commented Apr 13, 2020

This non-csv layout of the config perhaps balances verbose descriptions (which users will appreciate) without sacrifice layout/readability (which the csv-table does). However, we lose fancy TOC linking.

I personally prefer this style, and think you should be able to do TOC linking. Refs work fine for the setup I have for dask-gateway (https://gateway.dask.org/api-server.html), I feel like they should be doable here too.

@quasiben
Copy link
Member Author

@jcrist this is ready for a review if you have some time.

Note: i also included a dask-schema in the PR. This made iterating a bit faster for me but also happy to move to another PR if you'd like.

- Register the extension module appropriately.
- Lint conf.py and yamltohtml.py

.. dask_config_to_html::
:location: distributed.dashboard
:config: https://raw.githubusercontent.com/dask/distributed/70700a1059fdae542ddbb3534f3caa3d27ca2e5d/distributed/distributed.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you intend to change this when dask/distributed#3696 is merged, just making a note so it's not forgotten.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup yup, still futzing around with a few things

Upper limit for width, where width = num_nodes / height, a good measure
of parallelizability

subraphs:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct but I suspect this should be changed to subgraphs in a later PR

@quasiben
Copy link
Member Author

This is ready for another review should @jcrist or @mrocklin have some time

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks pretty good to me. Thanks @quasiben.

docs/source/conf.py Outdated Show resolved Hide resolved
dask/dask-schema.yaml Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Member

I think this is safe to merge when you're when you're ready. Maybe give @mrocklin a bit of time to object if he wants to take another look.

size:
type: integer
description: |
The size of pixels used when displaying a dask array as an SVG image.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The size of pixels used when displaying a dask array as an SVG image.
The size of the image in pixels used when displaying a dask array as an SVG image.

dask/dask-schema.yaml Outdated Show resolved Hide resolved
quasiben and others added 2 commits April 22, 2020 11:19
Co-Authored-By: Matthew Rocklin <mrocklin@gmail.com>
Co-Authored-By: Matthew Rocklin <mrocklin@gmail.com>
@mrocklin
Copy link
Member

mrocklin commented Apr 22, 2020 via email

@jcrist
Copy link
Member

jcrist commented Apr 22, 2020

I think that it would be useful to have a TOC at the top of the
configuration-reference page so that people can see quickly the options
that are available to them without scrolling linearly through everything

Good idea. That should be doable by inserting:

.. contents:: :local:

where you want it.

@quasiben
Copy link
Member Author

You won't get dask config values because the schema file is currently not in master. After merging you should be fine.

We could add a TOC or they could use the TOC on left bar:

Screen Shot 2020-04-22 at 11 22 56 AM

@quasiben
Copy link
Member Author

Thanks for the suggestion @jcrist and for all the reviews !

Screen Shot 2020-04-22 at 11 27 26 AM

@quasiben
Copy link
Member Author

Tests are currently failing because dask-schema is not in master. I can break that out into another PR then merge this or merge this and things should pass. Should they not I can immediately fix

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quasiben - Feel free to merge this PR as-is when you have time to make sure that everything goes according to plan.

@quasiben
Copy link
Member Author

Thanks @jsignell . I'll keep watch here and the docs

@quasiben quasiben merged commit 4c7b170 into dask:master Apr 24, 2020
@quasiben
Copy link
Member Author

docs are broken -- fixing now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Documentation Request] Configuration Reference
6 participants