Try to set LC_CTYPE to something UTF-8 capable #8031

faho · 2021-05-25T17:51:25Z

When fish is started with LC_CTYPE=C (even just effectively, often via
LC_ALL=C!), it's basically broken. There's no way to handle non-ASCII
characters with a C locale unless we want to write our
locale-independent replacements for all of the system functions.

Since we're not going to do that, let's try to find some locale for
LC_CTYPE.

We already do that in __fish_setlocale, but that's

a bit of a weird thing that reads unstandardized system
configuration files
allows setting locale to C explicitly

So it's still easily possible to end up in a broken configuration.

Now, the issue with this is that there is (AFAICT) no portable way to
get a list of all allowed locales and C.UTF-8 is not standardized, so
we have no one locale to fall back on and are forced to try a few. The
list we have here is quite arbitrary, but it's a start.

Python does something similar and only tries C.UTF-8, C.utf8 and
"UTF-8".

Once C.UTF-8 is (hopefully) standardized, that will just start
working (tm).

Note that we do not export the fixed LC_CTYPE variable, so external
programs still have to deal with the C locale, but we have no real
business messing with the user's environment.

TODOs:

Changes to fish usage are reflected in user documentation/manpages.
Tests have been added for regressions fixed
User-visible changes noted in CHANGELOG.rst

When fish is started with LC_CTYPE=C (even just effectively, often via LC_ALL=C!), it's basically broken. There's no way to handle non-ASCII characters with a C locale unless we want to write our locale-independent replacements for all of the system functions. Since we're not going to do that, let's try to find *some locale* for LC_CTYPE. We already do that in __fish_setlocale, but that's - a bit of a weird thing that reads unstandardized system configuration files - allows setting locale to C explicitly So it's still easily possible to end up in a broken configuration. Now, the issue with this is that there is (AFAICT) no portable way to get a list of all allowed locales and C.UTF-8 is not standardized, so we have no one locale to fall back on and are forced to try a few. The list we have here is quite arbitrary, but it's a start. Python does something similar and only tries C.UTF-8, C.utf8 and "UTF-8". Once C.UTF-8 is (hopefully) standardized, that will just start working (tm). Note that we do not *export* the fixed LC_CTYPE variable, so external programs still have to deal with the C locale, but we have no real business messing with the user's environment.

faho · 2021-05-25T17:59:01Z

Oh, also importantly, there is no opt-out of this, because I don't think there is anything that is fixed by having fish use a singlebyte locale. It's possible I'm wrong there.

ridiculousfish

Please relnote this of course.

init_locale runs at startup, and also when any of the locale variables change. Do we want to do this even if the user modifies LC_ALL via a set command? If not we could add a param to init_locale indicating if it's being run at startup or not.

src/env_dispatch.cpp

faho · 2021-05-26T06:59:45Z

init_locale runs at startup, and also when any of the locale variables change. Do we want to do this even if the user modifies LC_ALL via a set command?

The idea here is that an LC_CTYPE of C is broken.

LC_MESSAGES is a matter of which language you speak - do you want your messages in slovenian or chinese? LC_COLLATE is also a matter of regional preference - do you sort "Ä" after "Z" or equivalent to "A"?

By contrast, LC_CTYPE isn't a preference - if you set it to C, we can strictly work with less, and then providing non-ASCII input doesn't work. This is an issue because people tend to default to "LC_ALL=C" as a simple "reset my locale", and then continue providing emoji and such, even in filenames, and we can't work with them because functions like iswalnum just break. E.g. in UTF-8 mode you can set a variable called ö, and in ASCII mode you can't because it doesn't detect "ö" as a valid variable name character!

(this is of course assuming that there's two encodings worth caring about - 8-bit ASCII and UTF-8, and one is a strict subset of the other. On unix that's broadly true and there isn't really a good argument for anything else. I've also not seen anyone work with anything else unless it was a misconfiguration)

So, if a C locale is broken, the logical thing to do is to try to avoid it, even if the user indicated otherwise - and frequently they didn't, actually! They just set LC_ALL=C because that's the reset button. Some bug trackers even suggest reproducing with that to get messages in english, because they can't rely on en_US or en_GB or other variants being available.

faho · 2021-05-26T07:03:49Z

Also compare python's PEP 538, which does something quite similar to this.

krobelus · 2021-05-29T00:34:40Z

Great idea! This means that fish should work in a container even without setting a multibyte locale.
Python only does this when the locale is the default C, but this seems fine as well.

faho · 2021-05-30T10:40:50Z

I'm not entirely sure what to do about __fish_set_locale. It's slow and a bit of a hack (we read a bunch of config files we have no business reading), but it does a few things this doesn't, ATM:

It gives a full locale, with language and everything, in case that's passed wrong (or removed like that awful getty hack)
It gives a proper LC_CTYPE in case none from the list exist, e.g. if you only have fr_FR.UTF-8 built (and no C.UTF-8)
It actually sets the variables instead of just fixing fish

So I'm not sure if we should remove it or keep it. We could extend the list of locales this PR uses to everything glibc provides, which should fix it, to a first approximation, everywhere (but just for fish). Removing __fish_set_locale would be faster and allow us to add a warning in case of a misconfigured and unfixable locale (e.g. LC_ALL set to C, but no C.UTF-8 or anything available).

$fish_allow_singlebyte_locale, if set to something true (like "1"), will re-run the locale initialization and skip the bit where we force LC_CTYPE to be utf8-capable. This is mainly used in our tests, but might also be useful if people are trying to do something weird.

faho · 2021-06-06T07:27:43Z

Alright, I have now added a way to turn it off - setting $fish_allow_singlebyte_locale to 1 (strictly speaking something "bool_from_string" interprets as true).

This is mostly so we can actually test a C locale.

faho added enhancement RFC labels May 25, 2021

faho added this to the fish 3.3.0 milestone May 25, 2021

ridiculousfish approved these changes May 25, 2021

View reviewed changes

src/env_dispatch.cpp Outdated Show resolved Hide resolved

Use char* instead of std::string

bb9e31e

faho added 2 commits June 6, 2021 08:48

Document locale coercion

2068745

faho merged commit 046db09 into fish-shell:master Jun 6, 2021

faho deleted the fix-locale-harder branch July 29, 2021 15:57

faho mentioned this pull request Dec 18, 2021

Locale fallback doesn't consider Debian's /etc/default/locale #8557

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to set LC_CTYPE to something UTF-8 capable #8031

Try to set LC_CTYPE to something UTF-8 capable #8031

faho commented May 25, 2021

faho commented May 25, 2021

ridiculousfish left a comment

faho commented May 26, 2021 •

edited

faho commented May 26, 2021

krobelus commented May 29, 2021 •

edited

faho commented May 30, 2021 •

edited

faho commented Jun 6, 2021

Try to set LC_CTYPE to something UTF-8 capable #8031

Try to set LC_CTYPE to something UTF-8 capable #8031

Conversation

faho commented May 25, 2021

TODOs:

faho commented May 25, 2021

ridiculousfish left a comment

Choose a reason for hiding this comment

faho commented May 26, 2021 • edited

faho commented May 26, 2021

krobelus commented May 29, 2021 • edited

faho commented May 30, 2021 • edited

faho commented Jun 6, 2021

faho commented May 26, 2021 •

edited

krobelus commented May 29, 2021 •

edited

faho commented May 30, 2021 •

edited