Switch `_` -> `-` config normalization to `-` -> `_` #4422

jcrist · 2019-01-25T19:33:50Z

Previously we'd normalize all key names in dask.config to hyphenated
names, replacing all underscores with hyphens. This causes problems with
keys that should be taken as literal values (e.g. environment variables
in an env key). We really should only normalize key names that are
explicitly declared (some kind of schema), but for now we just swap the
normlalized form since underscores are more common for environment
variables.

Tests added / passed
Passes flake8 dask

Fixes #4366.

Previously we'd normalize all key names in `dask.config` to hyphenated names, replacing all underscores with hyphens. This causes problems with keys that should be taken as literal values (e.g. environment variables in an `env` key). We really should only normalize key names that are explicitly declared (some kind of schema), but for now we just swap the normlalized form since underscores are more common for environment variables.

jcrist · 2019-01-25T20:12:24Z

Looks like this causes distributed to fail to import, since distributed explicitly manipulates the dask.config.config dict directly. I could patch distributed, but then dask master would require distributed master, which is a bit restrictive.

It may be better to just avoid normalization for non-declared-keys. Thoughts @mrocklin, @djhoese?

djhoese · 2019-01-25T20:14:04Z

For the record, the way distributed is using the config (accessing the internal dict directly) is undesired in the long run, but I haven't thought of a good way to transition it. Let me look at the test failures now...

djhoese · 2019-01-25T20:14:52Z

Ok so normalizing all keys just isn't going to work. Is that what we're seeing?

jcrist · 2019-01-25T20:17:07Z

Not unless we accept requiring distributed master to use dask master, as it would require patching distributed as well.

djhoese · 2019-01-25T20:20:54Z

Well distributed could be updated to use the config "module" (eventually will be an object) and that should fix this distributed error, right? So distributed could be released working for new and old versions of dask core (as much as it does now). Then dask core could be released with a dependence on the newest version of distributed. Right?

Not great, but not completely ugly.

jcrist · 2019-01-25T20:29:34Z

Yes, distributed could be made compatible with old and new versions of dask, but dask won't be compatible with old versions of distributed.

djhoese · 2019-01-25T20:33:18Z

Or...can key normalization be turned off except for specific keys sort of as a deprecation cycle. So basically no config keys with hyphens in them (as discussed by me and @mrocklin in #4143 which was a response to #4141) so they can be set by environment variables.

jcrist · 2019-01-25T20:36:23Z

That seems like a lot of work to maintain backwards compatibility. I may give the only-normalize-declared-keys approach a try.

mrocklin · 2019-01-25T21:06:12Z

We probably shouldn't manipulate the config dict directly. A fix to that library would be welcome.

but then dask master would require distributed master, which is a bit restrictive

We can also just release distributed

djhoese · 2019-01-25T21:50:25Z

@jcrist I think it would be about the same amount of work as normalizing only declared keys. The main advantage is that dask could then standardize key names as being underscore-separated only. That way environment variables and python keyword arguments could both be used for setting configuration options. Right now there are a few hyphenated keys (as mentioned in #4143) that are historical, but besides that there is no reason they need to be hyphenated or should be. At least that is my understanding.

In the spirit of The Zen of Python, there should one and only one way to do things. Allowing multiple forms of config keys sounds like asking for trouble (in my experience). As we've seen it has caused some issues with the way it was implemented. However, I'm not a dask maintainer so feel free to disagree with me and point out what I'm missing.

jcrist · 2019-01-25T22:15:40Z

The main advantage is that dask could then standardize key names as being underscore-separated only.
...
Allowing multiple forms of config keys sounds like asking for trouble (in my experience).

I agree that multiple ways to spell the same key here is unfortunate and confusing. If we were to standardize on underscores only we'd still need a long deprecation cycle for any user code that sets them using hyphens. I see this as separate than the main issue here, which is that key normalization can accidentally affect config values rather than config keys. If we switch to underscores then we still have the same issue as before, but in reverse (users can't explicitly set values that contain dicts with hyphens). While hyphens may be less common, we really should never be modifying the values a user assigned to a configuration key.

Only applying normalization to known keys keeps the flexibility in spelling, and prevents the normalization from leaking into any nested datastructures the user might set as their value. Explicit keys to normalize, rather than implicitly normalizing all of them.

djhoese · 2019-01-25T22:29:26Z

To be clear, when you say "config values", you mean that someone is setting a dictionary as the value for a configuration key. Right now, dask doesn't have a way to know whether that dictionary is a nested configuration dictionary or a user dictionary value so it normalizes them, messing up peoples values. Right?

I was suggesting two separate solutions.

One is that we switch to underscores as the internal representation for config keys. As you said this gives us the same problem but is less likely to mess things up. Still not preferred.
Only normalize a specific set of keys (the historical ones with hyphens in them) and "standardize" any future dask keys (any new ones added by dask-core or dask subpackages) as requiring underscore separation and no hyphens. This underscore-only naming would be the convention, not a normalization, and wouldn't be handled specially in the config code.

Yes, this would be a long deprecation cycle but it (normalizing legacy key names) never has to change or be updated until it is removed (fully deprecated). This brings dask towards a final end goal of underscore-only config keys by convention.

All of that said, your solution of only normalizing specific keys does have the nice feature that you could stop analyzing sub-dictionaries because you know what is a value and what is a key, right? Hm, except once the legacy keys are fully deprecated in my suggestion then they don't have to be recursed at all.

jcrist · 2019-01-25T22:34:24Z

Only normalize a specific set of keys (the historical ones with hyphens in them) and "standardize" any future dask keys (any new ones added by dask-core or dask subpackages) as requiring underscore separation and no hyphens.

I think this is maybe the best solution, but don't know how the other dask-maintainers (e.g. @mrocklin) would feel about this. My proposal above avoids making this decision at the expense of more complicated code in dask.config.

djhoese · 2019-01-25T23:22:12Z

What I was saying before is that I'm pretty sure your proposal will work almost the same as mine, just a different set of keys that should be normalized. It would probably require researching for all hyphenated keys though.

jcrist · 2019-01-25T23:26:00Z

@mrocklin, would you be fine deprecating (with a substantial deprecation cycle) all the hyphenated names in favor of underscores only?

mrocklin · 2019-01-25T23:38:35Z

I have an aesthetic preference for hyphens over underscores in yaml config (and generally). I'll admit that I haven't followed the conversation in this issue entirely, is that what you're suggesting?

jcrist · 2019-01-25T23:46:09Z

The options proposed are:

Standardize on underscore names everywhere (with a deprecation cycle), and remove the normalization code. Underscore names work as environment variables, python kwargs, and yaml keys, so there's no need to normalize, and removing the normalization process prevents accidentally changing keys that really are keys in a dict value (like yarn.worker.env).
Only normalize a set of "registered" keys that dask cares about. This continues to allow underscores and hyphens everywhere, and avoids accidentally applying the normalization code to dict values (like yarn.worker.env). This would require all modules to register what keys they're looking for, but that could be done by automatically looking at the default files they distribute.

I agree with @djhoese that option 1 is probably the simplest and cleanest. Option 2 has more "aesthetic" config names, but would result in more complicated (and potentially fragile) code.

mrocklin · 2019-01-26T01:07:09Z

Normalization only seemed to be necessary when using environment variables.

I wonder if we might remove normalization generally, but continue to handle it for the environment variable case by checking for the presence of a hyphenated name. In the DASK_NUM_WORKERS case (the original reason for normalization in #4141) we would check to see if either num_workers or num-workers was in the config and, if so, use it. Otherwise we would default to underscores. Other than that no normalization would take place.

Searching for the -/_ difference might get a little complex with nested config values DASK_FOO__BAR__BAZ_QUUX, but might not be too bad. I think that this would probably reduce magic overall.

jcrist · 2019-01-26T01:31:18Z

I wonder if we might remove normalization generally, but continue to handle it for the environment variable case by checking for the presence of a hyphenated name.

Are you saying when setting an environment variable, check for a hyphenated version first to set it as, and fall back to an underscored version? So DASK_FOO_BAR would be stored as foo-bar if that key existed already, otherwise foo_bar?

If so, that won't work in the following scenario:

User sets DASK_FOO__BAR_BAZ=1 in the environment
User imports dask, configuration is loaded from config files and dask.yaml default
Config sees foo.bar-baz doesn't exist, stores it as foo.bar_baz
User then imports dask_foo package for the first time. Since it hadn't been imported yet, foo.yaml didn't exist, so none of the foo config values were loaded initially
Upon import, the dask_foo package sets foo.bar-baz as the default, since the environment variable was stored as foo.bar_baz above.

I really think we need to either:

Always normalize keys and ensure we only normalize keys that are actually keys (through explicit registration)
Never normalize keys and develop a consistent naming pattern that works for both (e.g. dask config names are always underscores or hyphens)

We could stick with hyphens if we enforced hyphenated names everywhere, and consistently normalized environment variables as envvar.lower().replace('__', '.').replace('_', '-')

mrocklin · 2019-01-26T01:52:40Z

Good point about the first import time case. Not surprising I suppose that magic gets in the way of magic. Still thinking

…

On Fri, Jan 25, 2019 at 5:31 PM Jim Crist ***@***.***> wrote: I wonder if we might remove normalization generally, but continue to handle it for the environment variable case by checking for the presence of a hyphenated name. Are you saying when setting an environment variable, check for a hyphenated version first to set it as, and fall back to an underscored version? So DASK_FOO_BAR would be stored as foo-bar if that key existed already, otherwise foo_bar? If so, that won't work in the following scenario: - User sets DASK_FOO__BAR_BAZ=1 in the environment - User imports dask, configuration is loaded from config files and dask.yaml default - Config sees foo.bar-baz doesn't exist, stores it as foo.bar_baz - User then imports dask_foo package for the *first time*. Since it hadn't been imported yet, foo.yaml didn't exist, so none of the foo config values were loaded initially - Upon import, the dask_foo package sets foo.bar-baz as the default, since the environment variable was stored as foo.bar_baz above. I really think we need to either: - Always normalize keys and ensure we only normalize keys that are actually keys (through explicit registration) - Never normalize keys and develop a consistent naming pattern that works for both (e.g. dask config names are always underscores or hyphens) We could stick with hyphens if we enforced hyphenated names everywhere, and consistently normalized environment variables as envvar.lower().replace('__', '.').replace('_', '-') — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4422 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszCiCeTlqmJEsLqGCU1Q67VoA0Kf_ks5vG6_ngaJpZM4aThUO> .

jcrist · 2019-01-26T16:22:26Z

We could stick with hyphens if we enforced hyphenated names everywhere, and consistently normalized environment variables as envvar.lower().replace('__', '.').replace('_', '-')

Thoughts on this ^^? No magic, still get hyphenated names, we just enforce them now.

mrocklin · 2019-01-26T22:35:55Z

What about when someone does something like dask.config.set(num_workers=10) ? Do we auto-normalize to num-workers?

jcrist · 2019-01-27T00:01:22Z

I forgot we also supported that syntax (although it doesn't look like we handle nested attributes that way e.g. yarn__worker__env={'foo_bar': 'baz'}). I'd say yes to then as well.

Really we just want to avoid any normalization for the following input forms:

keys in yaml files, as keys and dict values look the same

yarn:
  worker:
    env:
      foo_bar: baz

keys in dicts, as keys and dict values look the same

dask.config.set({'yarn': {'worker': {'env': {'foo_bar': 'baz'}}}})

One arguments for underscores instead of hyphens is that we'd only need to replace __ with . for the envar & keyword inputs, everything else would work as is.

jcrist · 2019-02-19T18:34:01Z

Any more thoughts here @mrocklin?

mrocklin · 2019-02-19T22:35:25Z

@jcrist sorry for the long delay here. Looking back on this issue with a fresh mind I find that I'm still not totally sure what is being proposed here.

It sounds like one suggestion is the following:

Standardize on underscore names everywhere (with a deprecation cycle), and remove the normalization code. Underscore names work as environment variables, python kwargs, and yaml keys, so there's no need to normalize, and removing the normalization process prevents accidentally changing keys that really are keys in a dict value (like yarn.worker.env).

That does seem consistent though I'll admit that I'll be sad to lose hyphens (which I find easier to type and somewhat more modern, though these are both subjective) and also I dread somewhat the config mismatches we'll get into. This would have to be a long deprecation cycle due to the presence of files on hard drives.

Folks have been talking about a set of "registered" names. Just as an FYI, any imported module will have its default configuration placed in dask.config.defaults, which may serve as such a registry. This only happens after import though, so isn't foolproof. I only bring this up as potential ammunition, I have no particular thought on how this can be used.

mrocklin · 2019-02-19T22:36:25Z

What happens if we just turn normalization off and leave everything else as-is? (This may already have been proposed, sorry)

Presumably the situation in #4141 returns. Maybe that's ok and we just say "use files, not environment variables"?

djhoese · 2019-02-20T01:30:40Z

I'll be sad to lose hyphens (which I find easier to type and somewhat more modern, though these are both subjective)

Yeah, opposite opinion, especially filenames. But opinion none the less.

Presumably the situation in #4141 returns. Maybe that's ok and we just say "use files, not environment variables"?

As someone who has users using environment variables a lot, I'd be sad to have this happen. This one especially (num_workers) is a pretty common one for people to want to change. Environment variables give easy access to "quick" changes and are used pretty often by some groups I've worked with, especially when bash scripts are common. If this is done I don't see why there couldn't also be a deprecation cycle for the existing keys...to an underscore standard. Darn, that was the main suggestion.

It would be really nice to not have to clarify the config usage with "any config value can be specified in python, by environment variable, or in a YAML config except for in case X, Y, Z where ... is not possible".

mrocklin · 2019-04-08T19:52:04Z

@jcrist and I sat down at spoke about thi last week. We came to an agreement where we don't normalize any values that go into the config, but we do teach get/set to treat them equivalently

dask.config.set({'x-y': {'a_b': 123}})

config = {
    'x-y': {'a_b': 123}
}

>>> dask.config.get('x_y')
{'a_b': 123}


dask.config.set(x_y={'a_b': 456})

config = {
    'x-y': {'a_b': 456}
}

jcrist · 2019-04-08T19:54:19Z

To clarify this further. If an equivalent value already exists in the mapping, we use that spelling (that's why x_y follows x-y above, since their equivalent). If a value doesn't already exist, we use the spelling exactly as provided by the user. This prevents unnecessary normalization, and still allows flexibility in key names.

jcrist · 2019-04-08T19:54:35Z

I hope to have some time to finish this up later this week.

djhoese · 2019-04-08T20:01:23Z

Ok so the config because "hyphen insensitive". Spending 30 seconds on thinking about it, does this mean that each get/set does a double lookup each time? I'm not worried about it performance-wise, but want to make sure.

I don't see any issues with this and glad something could be figured out.

jcrist · 2019-04-08T20:22:13Z

They do at most a double lookup (only if not found at the given name), but for internal accesses with standardized names they'll usually do a single lookup.

jcrist · 2019-04-26T21:02:48Z

Superseded by #4742. Closing.

jcrist mentioned this pull request Jan 25, 2019

Key normalization when reading config breaks logging configuration of Distributed #4366

Closed

ian-r-rose mentioned this pull request Apr 14, 2019

Configuration broken for kwargs that have underscores dask/dask-labextension#56

Closed

mrocklin added this to Proposed in Core maintenance Apr 17, 2019

jcrist mentioned this pull request Apr 26, 2019

Remove config key normalization #4742

Merged

jcrist closed this Apr 26, 2019

jcrist deleted the normalize-hyphen-to-underscore branch April 26, 2019 21:02

martindurant moved this from Proposed to Done in Core maintenance Apr 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch `_` -> `-` config normalization to `-` -> `_` #4422

Switch `_` -> `-` config normalization to `-` -> `_` #4422

jcrist commented Jan 25, 2019

jcrist commented Jan 25, 2019

djhoese commented Jan 25, 2019

djhoese commented Jan 25, 2019

jcrist commented Jan 25, 2019

djhoese commented Jan 25, 2019

jcrist commented Jan 25, 2019

djhoese commented Jan 25, 2019

jcrist commented Jan 25, 2019

mrocklin commented Jan 25, 2019

djhoese commented Jan 25, 2019 •

edited

Loading

jcrist commented Jan 25, 2019

djhoese commented Jan 25, 2019

jcrist commented Jan 25, 2019

djhoese commented Jan 25, 2019

jcrist commented Jan 25, 2019

mrocklin commented Jan 25, 2019

jcrist commented Jan 25, 2019

mrocklin commented Jan 26, 2019

jcrist commented Jan 26, 2019

mrocklin commented Jan 26, 2019 via email

jcrist commented Jan 26, 2019

mrocklin commented Jan 26, 2019

jcrist commented Jan 27, 2019

jcrist commented Feb 19, 2019

mrocklin commented Feb 19, 2019

mrocklin commented Feb 19, 2019

djhoese commented Feb 20, 2019

mrocklin commented Apr 8, 2019

jcrist commented Apr 8, 2019

jcrist commented Apr 8, 2019

djhoese commented Apr 8, 2019

jcrist commented Apr 8, 2019

jcrist commented Apr 26, 2019

Switch _ -> - config normalization to - -> _ #4422

Switch _ -> - config normalization to - -> _ #4422

Conversation

jcrist commented Jan 25, 2019

jcrist commented Jan 25, 2019

djhoese commented Jan 25, 2019

djhoese commented Jan 25, 2019

jcrist commented Jan 25, 2019

djhoese commented Jan 25, 2019

jcrist commented Jan 25, 2019

djhoese commented Jan 25, 2019

jcrist commented Jan 25, 2019

mrocklin commented Jan 25, 2019

djhoese commented Jan 25, 2019 • edited Loading

jcrist commented Jan 25, 2019

djhoese commented Jan 25, 2019

jcrist commented Jan 25, 2019

djhoese commented Jan 25, 2019

jcrist commented Jan 25, 2019

mrocklin commented Jan 25, 2019

jcrist commented Jan 25, 2019

mrocklin commented Jan 26, 2019

jcrist commented Jan 26, 2019

mrocklin commented Jan 26, 2019 via email

jcrist commented Jan 26, 2019

mrocklin commented Jan 26, 2019

jcrist commented Jan 27, 2019

jcrist commented Feb 19, 2019

mrocklin commented Feb 19, 2019

mrocklin commented Feb 19, 2019

djhoese commented Feb 20, 2019

mrocklin commented Apr 8, 2019

jcrist commented Apr 8, 2019

jcrist commented Apr 8, 2019

djhoese commented Apr 8, 2019

jcrist commented Apr 8, 2019

jcrist commented Apr 26, 2019

Switch `_` -> `-` config normalization to `-` -> `_` #4422

Switch `_` -> `-` config normalization to `-` -> `_` #4422

djhoese commented Jan 25, 2019 •

edited

Loading