New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to set LC_CTYPE to something UTF-8 capable #8031
Conversation
When fish is started with LC_CTYPE=C (even just effectively, often via LC_ALL=C!), it's basically broken. There's no way to handle non-ASCII characters with a C locale unless we want to write our locale-independent replacements for all of the system functions. Since we're not going to do that, let's try to find *some locale* for LC_CTYPE. We already do that in __fish_setlocale, but that's - a bit of a weird thing that reads unstandardized system configuration files - allows setting locale to C explicitly So it's still easily possible to end up in a broken configuration. Now, the issue with this is that there is (AFAICT) no portable way to get a list of all allowed locales and C.UTF-8 is not standardized, so we have no one locale to fall back on and are forced to try a few. The list we have here is quite arbitrary, but it's a start. Python does something similar and only tries C.UTF-8, C.utf8 and "UTF-8". Once C.UTF-8 is (hopefully) standardized, that will just start working (tm). Note that we do not *export* the fixed LC_CTYPE variable, so external programs still have to deal with the C locale, but we have no real business messing with the user's environment.
Oh, also importantly, there is no opt-out of this, because I don't think there is anything that is fixed by having fish use a singlebyte locale. It's possible I'm wrong there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please relnote this of course.
init_locale
runs at startup, and also when any of the locale variables change. Do we want to do this even if the user modifies LC_ALL
via a set
command? If not we could add a param to init_locale
indicating if it's being run at startup or not.
The idea here is that an LC_CTYPE of C is broken. LC_MESSAGES is a matter of which language you speak - do you want your messages in slovenian or chinese? LC_COLLATE is also a matter of regional preference - do you sort "Ä" after "Z" or equivalent to "A"? By contrast, LC_CTYPE isn't a preference - if you set it to C, we can strictly work with less, and then providing non-ASCII input doesn't work. This is an issue because people tend to default to "LC_ALL=C" as a simple "reset my locale", and then continue providing emoji and such, even in filenames, and we can't work with them because functions like iswalnum just break. E.g. in UTF-8 mode you can set a variable called (this is of course assuming that there's two encodings worth caring about - 8-bit ASCII and UTF-8, and one is a strict subset of the other. On unix that's broadly true and there isn't really a good argument for anything else. I've also not seen anyone work with anything else unless it was a misconfiguration) So, if a C locale is broken, the logical thing to do is to try to avoid it, even if the user indicated otherwise - and frequently they didn't, actually! They just set LC_ALL=C because that's the reset button. Some bug trackers even suggest reproducing with that to get messages in english, because they can't rely on en_US or en_GB or other variants being available. |
Also compare python's PEP 538, which does something quite similar to this. |
Great idea! This means that fish should work in a container even without setting a multibyte locale. |
I'm not entirely sure what to do about __fish_set_locale. It's slow and a bit of a hack (we read a bunch of config files we have no business reading), but it does a few things this doesn't, ATM:
So I'm not sure if we should remove it or keep it. We could extend the list of locales this PR uses to everything glibc provides, which should fix it, to a first approximation, everywhere (but just for fish). Removing __fish_set_locale would be faster and allow us to add a warning in case of a misconfigured and unfixable locale (e.g. LC_ALL set to C, but no C.UTF-8 or anything available). |
$fish_allow_singlebyte_locale, if set to something true (like "1"), will re-run the locale initialization and skip the bit where we force LC_CTYPE to be utf8-capable. This is mainly used in our tests, but might also be useful if people are trying to do something weird.
Alright, I have now added a way to turn it off - setting $fish_allow_singlebyte_locale to 1 (strictly speaking something "bool_from_string" interprets as true). This is mostly so we can actually test a C locale. |
When fish is started with LC_CTYPE=C (even just effectively, often via
LC_ALL=C!), it's basically broken. There's no way to handle non-ASCII
characters with a C locale unless we want to write our
locale-independent replacements for all of the system functions.
Since we're not going to do that, let's try to find some locale for
LC_CTYPE.
We already do that in __fish_setlocale, but that's
configuration files
So it's still easily possible to end up in a broken configuration.
Now, the issue with this is that there is (AFAICT) no portable way to
get a list of all allowed locales and C.UTF-8 is not standardized, so
we have no one locale to fall back on and are forced to try a few. The
list we have here is quite arbitrary, but it's a start.
Python does something similar and only tries C.UTF-8, C.utf8 and
"UTF-8".
Once C.UTF-8 is (hopefully) standardized, that will just start
working (tm).
Note that we do not export the fixed LC_CTYPE variable, so external
programs still have to deal with the C locale, but we have no real
business messing with the user's environment.
TODOs: