Configuration File #463

mrocklin · 2016-08-26T21:52:37Z

Continuation of #58

I think it's now time to have a configuration file. There are a few options that may be nicer to manage on a per-machine basis rather than in various command line options (though these will remain dominant) and hard coded settings.

Here are a few:

Logging levels for dask
Logging levels for the bokeh web application
Compression
Ports for the scheduler, json, web interface, etc..
Whitelisted ports for bokeh (though this is now open by default)
Whether or not to use PDB when an error occurs (I use this for debugging)

Some open questions:

Where do we put this file? I'm thinking ~/.dask/config
What format do we use, JSON, YAML, TOML, INI?
Are there other options that people find themselves often setting that we would want to include? We could also just include all options available through the CLI
Desired nesting level? For example

'scheduler': {'port': 8786,
              'bokeh': 8787}, 
...

vs

'scheduler-port': 8786,
'scheduler-bokeh': 8787,
...

@quasiben I would value your feedback in particular here.

I don't have much scar tissue on this topic.

The text was updated successfully, but these errors were encountered:

mrocklin · 2016-08-31T13:13:05Z

Currently going ahead with YAML. So far I'm only putting in options that I use personally. Planning to wait until people need more to add more:

logging:
  distributed: info
  distributed.executor: warning
  bokeh: critical

compression: auto

# Scheduler specific options

bandwidth: 100000000    # 100 MB/s estimated worker-worker bandwidth
allowed-failures: 3     # number of retries before a task is considered bad
pdb-on-err: False       # enter debug mode on scheduling error
transition-log-length: 100000

quasiben · 2016-08-31T13:21:16Z

All formats are bad

With that said, I've found yaml for config files to be not as a bad as other. YAML is:

human editable/readable
supports comments

YAML is a bit better with some extra tooling like:

https://yaml.readthedocs.io/en/latest/overview.html -- comment preservation and round trip support
http://schematics.readthedocs.io/en/latest/ -- schema/type validation

mrocklin · 2016-08-31T13:35:23Z

Implemented in #472

minrk · 2016-09-01T14:14:00Z

Personally, I like having nesting for grouping related config:

scheduler:
   port: 123

I don't know what the scope of your configurability would be, though.

Since all the CI services use yaml, I think developers are getting used to it, so it makes sense to me.

mrocklin · 2016-09-02T23:30:37Z

Fixed by #472

kszucs · 2016-09-09T06:53:07Z

Personally I favor environment variables over any configuration file. In our distributed (docker containers on top of mesos, marathon, chronos) setup the common practice is also env variables, distributing files is way more problematic (needs shared storage like HDFS/S3). Click also has built-in support for reading options from env.

In our workflow manager a click cli script submits the computation as a chronos or marathon (meta schedulers on top of mesos) task, which starts a mesos (dask.mesos) framework, which schedules multiple tasks across the cluster. All of these tasks can start for example a local dask computation, a distributed spark job, another mesos framework, a data migration tool etc. The workflow manager needs to forward/ship the configuration down to the leaves (for example a cassandra host:port).

Personally I use dask.context._globals for this purpose. IMHO that would be a better container to store and ship config values (read from cli and environment variables), especially because I can temporarily override the values with set_options.

Auto-shipping can be solved via a custom pickler:

def inject_addons(self):
    self.save_reduce(lambda opts: set_options(**opts), (_globals,))

# register reducer to auto pickle _globals configuration
CloudPickler.inject_addons = inject_addons

mrocklin · 2019-01-14T21:45:27Z

There generally is no centralization documentation for these except for the files themselves, which should auto-populate into your ~/.config/dask directory the first time you import any dask sub-project. For the dask-distributed project in particular you can look at https://github.com/dask/distributed/blob/master/distributed/distributed.yaml

…

On Mon, Jan 14, 2019 at 12:46 PM Scott Brown ***@***.***> wrote: Is there documentation for the possible options in a yaml configuration? I can't seem to locate such a document, and instead find small examples here and there of possible configuration subsets. Where are all possible options documented? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#463 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszEJqhiGfy-Z90Z4VrkSJ6JK-y-k6ks5vDOyhgaJpZM4Juf34> .

mrocklin mentioned this issue Aug 31, 2016

Add config.yaml file #472

Merged

mrocklin closed this as completed Sep 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration File #463

Configuration File #463

mrocklin commented Aug 26, 2016 •

edited

Loading

mrocklin commented Aug 31, 2016

quasiben commented Aug 31, 2016

mrocklin commented Aug 31, 2016

minrk commented Sep 1, 2016

mrocklin commented Sep 2, 2016

kszucs commented Sep 9, 2016 •

edited

Loading

mrocklin commented Jan 14, 2019 via email

Configuration File #463

Configuration File #463

Comments

mrocklin commented Aug 26, 2016 • edited Loading

mrocklin commented Aug 31, 2016

quasiben commented Aug 31, 2016

mrocklin commented Aug 31, 2016

minrk commented Sep 1, 2016

mrocklin commented Sep 2, 2016

kszucs commented Sep 9, 2016 • edited Loading

mrocklin commented Jan 14, 2019 via email

mrocklin commented Aug 26, 2016 •

edited

Loading

kszucs commented Sep 9, 2016 •

edited

Loading