-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch _
-> -
config normalization to -
-> _
#4422
Conversation
Previously we'd normalize all key names in `dask.config` to hyphenated names, replacing all underscores with hyphens. This causes problems with keys that should be taken as literal values (e.g. environment variables in an `env` key). We really should only normalize key names that are explicitly declared (some kind of schema), but for now we just swap the normlalized form since underscores are more common for environment variables.
Looks like this causes distributed to fail to import, since distributed explicitly manipulates the It may be better to just avoid normalization for non-declared-keys. Thoughts @mrocklin, @djhoese? |
For the record, the way distributed is using the config (accessing the internal |
Ok so normalizing all keys just isn't going to work. Is that what we're seeing? |
Not unless we accept requiring distributed master to use dask master, as it would require patching distributed as well. |
Well distributed could be updated to use the config "module" (eventually will be an object) and that should fix this distributed error, right? So distributed could be released working for new and old versions of dask core (as much as it does now). Then dask core could be released with a dependence on the newest version of distributed. Right? Not great, but not completely ugly. |
Yes, distributed could be made compatible with old and new versions of dask, but dask won't be compatible with old versions of distributed. |
That seems like a lot of work to maintain backwards compatibility. I may give the only-normalize-declared-keys approach a try. |
We probably shouldn't manipulate the config dict directly. A fix to that library would be welcome.
We can also just release distributed |
@jcrist I think it would be about the same amount of work as normalizing only declared keys. The main advantage is that dask could then standardize key names as being underscore-separated only. That way environment variables and python keyword arguments could both be used for setting configuration options. Right now there are a few hyphenated keys (as mentioned in #4143) that are historical, but besides that there is no reason they need to be hyphenated or should be. At least that is my understanding. In the spirit of The Zen of Python, there should one and only one way to do things. Allowing multiple forms of config keys sounds like asking for trouble (in my experience). As we've seen it has caused some issues with the way it was implemented. However, I'm not a dask maintainer so feel free to disagree with me and point out what I'm missing. |
I agree that multiple ways to spell the same key here is unfortunate and confusing. If we were to standardize on underscores only we'd still need a long deprecation cycle for any user code that sets them using hyphens. I see this as separate than the main issue here, which is that key normalization can accidentally affect config values rather than config keys. If we switch to underscores then we still have the same issue as before, but in reverse (users can't explicitly set values that contain dicts with hyphens). While hyphens may be less common, we really should never be modifying the values a user assigned to a configuration key. Only applying normalization to known keys keeps the flexibility in spelling, and prevents the normalization from leaking into any nested datastructures the user might set as their value. Explicit keys to normalize, rather than implicitly normalizing all of them. |
To be clear, when you say "config values", you mean that someone is setting a dictionary as the value for a configuration key. Right now, dask doesn't have a way to know whether that dictionary is a nested configuration dictionary or a user dictionary value so it normalizes them, messing up peoples values. Right? I was suggesting two separate solutions.
All of that said, your solution of only normalizing specific keys does have the nice feature that you could stop analyzing sub-dictionaries because you know what is a value and what is a key, right? Hm, except once the legacy keys are fully deprecated in my suggestion then they don't have to be recursed at all. |
I think this is maybe the best solution, but don't know how the other dask-maintainers (e.g. @mrocklin) would feel about this. My proposal above avoids making this decision at the expense of more complicated code in |
What I was saying before is that I'm pretty sure your proposal will work almost the same as mine, just a different set of keys that should be normalized. It would probably require researching for all hyphenated keys though. |
@mrocklin, would you be fine deprecating (with a substantial deprecation cycle) all the hyphenated names in favor of underscores only? |
I have an aesthetic preference for hyphens over underscores in yaml config (and generally). I'll admit that I haven't followed the conversation in this issue entirely, is that what you're suggesting? |
The options proposed are:
I agree with @djhoese that option 1 is probably the simplest and cleanest. Option 2 has more "aesthetic" config names, but would result in more complicated (and potentially fragile) code. |
Normalization only seemed to be necessary when using environment variables. I wonder if we might remove normalization generally, but continue to handle it for the environment variable case by checking for the presence of a hyphenated name. In the Searching for the |
Are you saying when setting an environment variable, check for a hyphenated version first to set it as, and fall back to an underscored version? So If so, that won't work in the following scenario:
I really think we need to either:
We could stick with hyphens if we enforced hyphenated names everywhere, and consistently normalized environment variables as |
Good point about the first import time case. Not surprising I suppose that
magic gets in the way of magic.
Still thinking
…On Fri, Jan 25, 2019 at 5:31 PM Jim Crist ***@***.***> wrote:
I wonder if we might remove normalization generally, but continue to
handle it for the environment variable case by checking for the presence of
a hyphenated name.
Are you saying when setting an environment variable, check for a
hyphenated version first to set it as, and fall back to an underscored
version? So DASK_FOO_BAR would be stored as foo-bar if that key existed
already, otherwise foo_bar?
If so, that won't work in the following scenario:
- User sets DASK_FOO__BAR_BAZ=1 in the environment
- User imports dask, configuration is loaded from config files and
dask.yaml default
- Config sees foo.bar-baz doesn't exist, stores it as foo.bar_baz
- User then imports dask_foo package for the *first time*. Since it
hadn't been imported yet, foo.yaml didn't exist, so none of the foo
config values were loaded initially
- Upon import, the dask_foo package sets foo.bar-baz as the default,
since the environment variable was stored as foo.bar_baz above.
I really think we need to either:
- Always normalize keys and ensure we only normalize keys that are
actually keys (through explicit registration)
- Never normalize keys and develop a consistent naming pattern that
works for both (e.g. dask config names are always underscores or hyphens)
We could stick with hyphens if we enforced hyphenated names everywhere,
and consistently normalized environment variables as envvar.lower().replace('__',
'.').replace('_', '-')
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4422 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszCiCeTlqmJEsLqGCU1Q67VoA0Kf_ks5vG6_ngaJpZM4aThUO>
.
|
Thoughts on this ^^? No magic, still get hyphenated names, we just enforce them now. |
What about when someone does something like |
I forgot we also supported that syntax (although it doesn't look like we handle nested attributes that way e.g. Really we just want to avoid any normalization for the following input forms:
One arguments for underscores instead of hyphens is that we'd only need to replace |
Any more thoughts here @mrocklin? |
@jcrist sorry for the long delay here. Looking back on this issue with a fresh mind I find that I'm still not totally sure what is being proposed here. It sounds like one suggestion is the following:
That does seem consistent though I'll admit that I'll be sad to lose hyphens (which I find easier to type and somewhat more modern, though these are both subjective) and also I dread somewhat the config mismatches we'll get into. This would have to be a long deprecation cycle due to the presence of files on hard drives. Folks have been talking about a set of "registered" names. Just as an FYI, any imported module will have its default configuration placed in |
What happens if we just turn normalization off and leave everything else as-is? (This may already have been proposed, sorry) Presumably the situation in #4141 returns. Maybe that's ok and we just say "use files, not environment variables"? |
Yeah, opposite opinion, especially filenames. But opinion none the less.
As someone who has users using environment variables a lot, I'd be sad to have this happen. This one especially (num_workers) is a pretty common one for people to want to change. Environment variables give easy access to "quick" changes and are used pretty often by some groups I've worked with, especially when bash scripts are common. If this is done I don't see why there couldn't also be a deprecation cycle for the existing keys...to an underscore standard. Darn, that was the main suggestion. It would be really nice to not have to clarify the config usage with "any config value can be specified in python, by environment variable, or in a YAML config except for in case X, Y, Z where ... is not possible". |
@jcrist and I sat down at spoke about thi last week. We came to an agreement where we don't normalize any values that go into the config, but we do teach get/set to treat them equivalently dask.config.set({'x-y': {'a_b': 123}})
config = {
'x-y': {'a_b': 123}
}
>>> dask.config.get('x_y')
{'a_b': 123}
dask.config.set(x_y={'a_b': 456})
config = {
'x-y': {'a_b': 456}
} |
To clarify this further. If an equivalent value already exists in the mapping, we use that spelling (that's why |
I hope to have some time to finish this up later this week. |
Ok so the config because "hyphen insensitive". Spending 30 seconds on thinking about it, does this mean that each get/set does a double lookup each time? I'm not worried about it performance-wise, but want to make sure. I don't see any issues with this and glad something could be figured out. |
They do at most a double lookup (only if not found at the given name), but for internal accesses with standardized names they'll usually do a single lookup. |
Superseded by #4742. Closing. |
Previously we'd normalize all key names in
dask.config
to hyphenatednames, replacing all underscores with hyphens. This causes problems with
keys that should be taken as literal values (e.g. environment variables
in an
env
key). We really should only normalize key names that areexplicitly declared (some kind of schema), but for now we just swap the
normlalized form since underscores are more common for environment
variables.
flake8 dask
Fixes #4366.