Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch _ -> - config normalization to - -> _ #4422

Closed
wants to merge 1 commit into from

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Jan 25, 2019

Previously we'd normalize all key names in dask.config to hyphenated
names, replacing all underscores with hyphens. This causes problems with
keys that should be taken as literal values (e.g. environment variables
in an env key). We really should only normalize key names that are
explicitly declared (some kind of schema), but for now we just swap the
normlalized form since underscores are more common for environment
variables.

  • Tests added / passed
  • Passes flake8 dask

Fixes #4366.

Previously we'd normalize all key names in `dask.config` to hyphenated
names, replacing all underscores with hyphens. This causes problems with
keys that should be taken as literal values (e.g. environment variables
in an `env` key). We really should only normalize key names that are
explicitly declared (some kind of schema), but for now we just swap the
normlalized form since underscores are more common for environment
variables.
@jcrist
Copy link
Member Author

jcrist commented Jan 25, 2019

Looks like this causes distributed to fail to import, since distributed explicitly manipulates the dask.config.config dict directly. I could patch distributed, but then dask master would require distributed master, which is a bit restrictive.

It may be better to just avoid normalization for non-declared-keys. Thoughts @mrocklin, @djhoese?

@djhoese
Copy link
Contributor

djhoese commented Jan 25, 2019

For the record, the way distributed is using the config (accessing the internal dict directly) is undesired in the long run, but I haven't thought of a good way to transition it. Let me look at the test failures now...

@djhoese
Copy link
Contributor

djhoese commented Jan 25, 2019

Ok so normalizing all keys just isn't going to work. Is that what we're seeing?

@jcrist
Copy link
Member Author

jcrist commented Jan 25, 2019

Not unless we accept requiring distributed master to use dask master, as it would require patching distributed as well.

@djhoese
Copy link
Contributor

djhoese commented Jan 25, 2019

Well distributed could be updated to use the config "module" (eventually will be an object) and that should fix this distributed error, right? So distributed could be released working for new and old versions of dask core (as much as it does now). Then dask core could be released with a dependence on the newest version of distributed. Right?

Not great, but not completely ugly.

@jcrist
Copy link
Member Author

jcrist commented Jan 25, 2019

Yes, distributed could be made compatible with old and new versions of dask, but dask won't be compatible with old versions of distributed.

@djhoese
Copy link
Contributor

djhoese commented Jan 25, 2019

Or...can key normalization be turned off except for specific keys sort of as a deprecation cycle. So basically no config keys with hyphens in them (as discussed by me and @mrocklin in #4143 which was a response to #4141) so they can be set by environment variables.

@jcrist
Copy link
Member Author

jcrist commented Jan 25, 2019

That seems like a lot of work to maintain backwards compatibility. I may give the only-normalize-declared-keys approach a try.

@mrocklin
Copy link
Member

We probably shouldn't manipulate the config dict directly. A fix to that library would be welcome.

but then dask master would require distributed master, which is a bit restrictive

We can also just release distributed

@djhoese
Copy link
Contributor

djhoese commented Jan 25, 2019

@jcrist I think it would be about the same amount of work as normalizing only declared keys. The main advantage is that dask could then standardize key names as being underscore-separated only. That way environment variables and python keyword arguments could both be used for setting configuration options. Right now there are a few hyphenated keys (as mentioned in #4143) that are historical, but besides that there is no reason they need to be hyphenated or should be. At least that is my understanding.

In the spirit of The Zen of Python, there should one and only one way to do things. Allowing multiple forms of config keys sounds like asking for trouble (in my experience). As we've seen it has caused some issues with the way it was implemented. However, I'm not a dask maintainer so feel free to disagree with me and point out what I'm missing.

@jcrist
Copy link
Member Author

jcrist commented Jan 25, 2019

The main advantage is that dask could then standardize key names as being underscore-separated only.
...
Allowing multiple forms of config keys sounds like asking for trouble (in my experience).

I agree that multiple ways to spell the same key here is unfortunate and confusing. If we were to standardize on underscores only we'd still need a long deprecation cycle for any user code that sets them using hyphens. I see this as separate than the main issue here, which is that key normalization can accidentally affect config values rather than config keys. If we switch to underscores then we still have the same issue as before, but in reverse (users can't explicitly set values that contain dicts with hyphens). While hyphens may be less common, we really should never be modifying the values a user assigned to a configuration key.

Only applying normalization to known keys keeps the flexibility in spelling, and prevents the normalization from leaking into any nested datastructures the user might set as their value. Explicit keys to normalize, rather than implicitly normalizing all of them.

@djhoese
Copy link
Contributor

djhoese commented Jan 25, 2019

To be clear, when you say "config values", you mean that someone is setting a dictionary as the value for a configuration key. Right now, dask doesn't have a way to know whether that dictionary is a nested configuration dictionary or a user dictionary value so it normalizes them, messing up peoples values. Right?

I was suggesting two separate solutions.

  1. One is that we switch to underscores as the internal representation for config keys. As you said this gives us the same problem but is less likely to mess things up. Still not preferred.

  2. Only normalize a specific set of keys (the historical ones with hyphens in them) and "standardize" any future dask keys (any new ones added by dask-core or dask subpackages) as requiring underscore separation and no hyphens. This underscore-only naming would be the convention, not a normalization, and wouldn't be handled specially in the config code.

    Yes, this would be a long deprecation cycle but it (normalizing legacy key names) never has to change or be updated until it is removed (fully deprecated). This brings dask towards a final end goal of underscore-only config keys by convention.

All of that said, your solution of only normalizing specific keys does have the nice feature that you could stop analyzing sub-dictionaries because you know what is a value and what is a key, right? Hm, except once the legacy keys are fully deprecated in my suggestion then they don't have to be recursed at all.

@jcrist
Copy link
Member Author

jcrist commented Jan 25, 2019

Only normalize a specific set of keys (the historical ones with hyphens in them) and "standardize" any future dask keys (any new ones added by dask-core or dask subpackages) as requiring underscore separation and no hyphens.

I think this is maybe the best solution, but don't know how the other dask-maintainers (e.g. @mrocklin) would feel about this. My proposal above avoids making this decision at the expense of more complicated code in dask.config.

@djhoese
Copy link
Contributor

djhoese commented Jan 25, 2019

What I was saying before is that I'm pretty sure your proposal will work almost the same as mine, just a different set of keys that should be normalized. It would probably require researching for all hyphenated keys though.

@jcrist
Copy link
Member Author

jcrist commented Jan 25, 2019

@mrocklin, would you be fine deprecating (with a substantial deprecation cycle) all the hyphenated names in favor of underscores only?

@mrocklin
Copy link
Member

I have an aesthetic preference for hyphens over underscores in yaml config (and generally). I'll admit that I haven't followed the conversation in this issue entirely, is that what you're suggesting?

@jcrist
Copy link
Member Author

jcrist commented Jan 25, 2019

The options proposed are:

  • Standardize on underscore names everywhere (with a deprecation cycle), and remove the normalization code. Underscore names work as environment variables, python kwargs, and yaml keys, so there's no need to normalize, and removing the normalization process prevents accidentally changing keys that really are keys in a dict value (like yarn.worker.env).

  • Only normalize a set of "registered" keys that dask cares about. This continues to allow underscores and hyphens everywhere, and avoids accidentally applying the normalization code to dict values (like yarn.worker.env). This would require all modules to register what keys they're looking for, but that could be done by automatically looking at the default files they distribute.

I agree with @djhoese that option 1 is probably the simplest and cleanest. Option 2 has more "aesthetic" config names, but would result in more complicated (and potentially fragile) code.

@mrocklin
Copy link
Member

Normalization only seemed to be necessary when using environment variables.

I wonder if we might remove normalization generally, but continue to handle it for the environment variable case by checking for the presence of a hyphenated name. In the DASK_NUM_WORKERS case (the original reason for normalization in #4141) we would check to see if either num_workers or num-workers was in the config and, if so, use it. Otherwise we would default to underscores. Other than that no normalization would take place.

Searching for the -/_ difference might get a little complex with nested config values DASK_FOO__BAR__BAZ_QUUX, but might not be too bad. I think that this would probably reduce magic overall.

@jcrist
Copy link
Member Author

jcrist commented Jan 26, 2019

I wonder if we might remove normalization generally, but continue to handle it for the environment variable case by checking for the presence of a hyphenated name.

Are you saying when setting an environment variable, check for a hyphenated version first to set it as, and fall back to an underscored version? So DASK_FOO_BAR would be stored as foo-bar if that key existed already, otherwise foo_bar?

If so, that won't work in the following scenario:

  • User sets DASK_FOO__BAR_BAZ=1 in the environment
  • User imports dask, configuration is loaded from config files and dask.yaml default
  • Config sees foo.bar-baz doesn't exist, stores it as foo.bar_baz
  • User then imports dask_foo package for the first time. Since it hadn't been imported yet, foo.yaml didn't exist, so none of the foo config values were loaded initially
  • Upon import, the dask_foo package sets foo.bar-baz as the default, since the environment variable was stored as foo.bar_baz above.

I really think we need to either:

  • Always normalize keys and ensure we only normalize keys that are actually keys (through explicit registration)
  • Never normalize keys and develop a consistent naming pattern that works for both (e.g. dask config names are always underscores or hyphens)

We could stick with hyphens if we enforced hyphenated names everywhere, and consistently normalized environment variables as envvar.lower().replace('__', '.').replace('_', '-')

@mrocklin
Copy link
Member

mrocklin commented Jan 26, 2019 via email

@jcrist
Copy link
Member Author

jcrist commented Jan 26, 2019

We could stick with hyphens if we enforced hyphenated names everywhere, and consistently normalized environment variables as envvar.lower().replace('__', '.').replace('_', '-')

Thoughts on this ^^? No magic, still get hyphenated names, we just enforce them now.

@mrocklin
Copy link
Member

What about when someone does something like dask.config.set(num_workers=10) ? Do we auto-normalize to num-workers?

@jcrist
Copy link
Member Author

jcrist commented Jan 27, 2019

I forgot we also supported that syntax (although it doesn't look like we handle nested attributes that way e.g. yarn__worker__env={'foo_bar': 'baz'}). I'd say yes to then as well.

Really we just want to avoid any normalization for the following input forms:

  • keys in yaml files, as keys and dict values look the same
yarn:
  worker:
    env:
      foo_bar: baz
  • keys in dicts, as keys and dict values look the same
dask.config.set({'yarn': {'worker': {'env': {'foo_bar': 'baz'}}}})

One arguments for underscores instead of hyphens is that we'd only need to replace __ with . for the envar & keyword inputs, everything else would work as is.

@jcrist
Copy link
Member Author

jcrist commented Feb 19, 2019

Any more thoughts here @mrocklin?

@mrocklin
Copy link
Member

@jcrist sorry for the long delay here. Looking back on this issue with a fresh mind I find that I'm still not totally sure what is being proposed here.

It sounds like one suggestion is the following:

Standardize on underscore names everywhere (with a deprecation cycle), and remove the normalization code. Underscore names work as environment variables, python kwargs, and yaml keys, so there's no need to normalize, and removing the normalization process prevents accidentally changing keys that really are keys in a dict value (like yarn.worker.env).

That does seem consistent though I'll admit that I'll be sad to lose hyphens (which I find easier to type and somewhat more modern, though these are both subjective) and also I dread somewhat the config mismatches we'll get into. This would have to be a long deprecation cycle due to the presence of files on hard drives.

Folks have been talking about a set of "registered" names. Just as an FYI, any imported module will have its default configuration placed in dask.config.defaults, which may serve as such a registry. This only happens after import though, so isn't foolproof. I only bring this up as potential ammunition, I have no particular thought on how this can be used.

@mrocklin
Copy link
Member

What happens if we just turn normalization off and leave everything else as-is? (This may already have been proposed, sorry)

Presumably the situation in #4141 returns. Maybe that's ok and we just say "use files, not environment variables"?

@djhoese
Copy link
Contributor

djhoese commented Feb 20, 2019

I'll be sad to lose hyphens (which I find easier to type and somewhat more modern, though these are both subjective)

Yeah, opposite opinion, especially filenames. But opinion none the less.

Presumably the situation in #4141 returns. Maybe that's ok and we just say "use files, not environment variables"?

As someone who has users using environment variables a lot, I'd be sad to have this happen. This one especially (num_workers) is a pretty common one for people to want to change. Environment variables give easy access to "quick" changes and are used pretty often by some groups I've worked with, especially when bash scripts are common. If this is done I don't see why there couldn't also be a deprecation cycle for the existing keys...to an underscore standard. Darn, that was the main suggestion.

It would be really nice to not have to clarify the config usage with "any config value can be specified in python, by environment variable, or in a YAML config except for in case X, Y, Z where ... is not possible".

@mrocklin
Copy link
Member

mrocklin commented Apr 8, 2019

@jcrist and I sat down at spoke about thi last week. We came to an agreement where we don't normalize any values that go into the config, but we do teach get/set to treat them equivalently

dask.config.set({'x-y': {'a_b': 123}})

config = {
    'x-y': {'a_b': 123}
}

>>> dask.config.get('x_y')
{'a_b': 123}


dask.config.set(x_y={'a_b': 456})

config = {
    'x-y': {'a_b': 456}
}

@jcrist
Copy link
Member Author

jcrist commented Apr 8, 2019

To clarify this further. If an equivalent value already exists in the mapping, we use that spelling (that's why x_y follows x-y above, since their equivalent). If a value doesn't already exist, we use the spelling exactly as provided by the user. This prevents unnecessary normalization, and still allows flexibility in key names.

@jcrist
Copy link
Member Author

jcrist commented Apr 8, 2019

I hope to have some time to finish this up later this week.

@djhoese
Copy link
Contributor

djhoese commented Apr 8, 2019

Ok so the config because "hyphen insensitive". Spending 30 seconds on thinking about it, does this mean that each get/set does a double lookup each time? I'm not worried about it performance-wise, but want to make sure.

I don't see any issues with this and glad something could be figured out.

@jcrist
Copy link
Member Author

jcrist commented Apr 8, 2019

They do at most a double lookup (only if not found at the given name), but for internal accesses with standardized names they'll usually do a single lookup.

@jcrist
Copy link
Member Author

jcrist commented Apr 26, 2019

Superseded by #4742. Closing.

@jcrist jcrist closed this Apr 26, 2019
@jcrist jcrist deleted the normalize-hyphen-to-underscore branch April 26, 2019 21:02
@martindurant martindurant moved this from Proposed to Done in Core maintenance Apr 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

Key normalization when reading config breaks logging configuration of Distributed
3 participants