Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration File #463

Closed
mrocklin opened this issue Aug 26, 2016 · 7 comments
Closed

Configuration File #463

mrocklin opened this issue Aug 26, 2016 · 7 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented Aug 26, 2016

Continuation of #58

I think it's now time to have a configuration file. There are a few options that may be nicer to manage on a per-machine basis rather than in various command line options (though these will remain dominant) and hard coded settings.

Here are a few:

  1. Logging levels for dask
  2. Logging levels for the bokeh web application
  3. Compression
  4. Ports for the scheduler, json, web interface, etc..
  5. Whitelisted ports for bokeh (though this is now open by default)
  6. Whether or not to use PDB when an error occurs (I use this for debugging)

Some open questions:

  1. Where do we put this file? I'm thinking ~/.dask/config
  2. What format do we use, JSON, YAML, TOML, INI?
  3. Are there other options that people find themselves often setting that we would want to include? We could also just include all options available through the CLI
  4. Desired nesting level? For example
'scheduler': {'port': 8786,
              'bokeh': 8787}, 
...

vs

'scheduler-port': 8786,
'scheduler-bokeh': 8787,
...

@quasiben I would value your feedback in particular here.

I don't have much scar tissue on this topic.

@mrocklin
Copy link
Member Author

Currently going ahead with YAML. So far I'm only putting in options that I use personally. Planning to wait until people need more to add more:

logging:
  distributed: info
  distributed.executor: warning
  bokeh: critical

compression: auto

# Scheduler specific options

bandwidth: 100000000    # 100 MB/s estimated worker-worker bandwidth
allowed-failures: 3     # number of retries before a task is considered bad
pdb-on-err: False       # enter debug mode on scheduling error
transition-log-length: 100000

@quasiben
Copy link
Member

All formats are bad

With that said, I've found yaml for config files to be not as a bad as other. YAML is:

  • human editable/readable
  • supports comments

YAML is a bit better with some extra tooling like:

@mrocklin
Copy link
Member Author

Implemented in #472

@minrk
Copy link
Contributor

minrk commented Sep 1, 2016

Personally, I like having nesting for grouping related config:

scheduler:
   port: 123

I don't know what the scope of your configurability would be, though.

Since all the CI services use yaml, I think developers are getting used to it, so it makes sense to me.

@mrocklin
Copy link
Member Author

mrocklin commented Sep 2, 2016

Fixed by #472

@mrocklin mrocklin closed this as completed Sep 2, 2016
@kszucs
Copy link
Contributor

kszucs commented Sep 9, 2016

Personally I favor environment variables over any configuration file. In our distributed (docker containers on top of mesos, marathon, chronos) setup the common practice is also env variables, distributing files is way more problematic (needs shared storage like HDFS/S3). Click also has built-in support for reading options from env.

In our workflow manager a click cli script submits the computation as a chronos or marathon (meta schedulers on top of mesos) task, which starts a mesos (dask.mesos) framework, which schedules multiple tasks across the cluster. All of these tasks can start for example a local dask computation, a distributed spark job, another mesos framework, a data migration tool etc. The workflow manager needs to forward/ship the configuration down to the leaves (for example a cassandra host:port).

Personally I use dask.context._globals for this purpose. IMHO that would be a better container to store and ship config values (read from cli and environment variables), especially because I can temporarily override the values with set_options.

Auto-shipping can be solved via a custom pickler:

def inject_addons(self):
    self.save_reduce(lambda opts: set_options(**opts), (_globals,))

# register reducer to auto pickle _globals configuration
CloudPickler.inject_addons = inject_addons

@mrocklin
Copy link
Member Author

mrocklin commented Jan 14, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants