Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to set LC_CTYPE to something UTF-8 capable #8031

Merged
merged 4 commits into from Jun 6, 2021

Conversation

faho
Copy link
Member

@faho faho commented May 25, 2021

When fish is started with LC_CTYPE=C (even just effectively, often via
LC_ALL=C!), it's basically broken. There's no way to handle non-ASCII
characters with a C locale unless we want to write our
locale-independent replacements for all of the system functions.

Since we're not going to do that, let's try to find some locale for
LC_CTYPE.

We already do that in __fish_setlocale, but that's

  • a bit of a weird thing that reads unstandardized system
    configuration files
  • allows setting locale to C explicitly

So it's still easily possible to end up in a broken configuration.

Now, the issue with this is that there is (AFAICT) no portable way to
get a list of all allowed locales and C.UTF-8 is not standardized, so
we have no one locale to fall back on and are forced to try a few. The
list we have here is quite arbitrary, but it's a start.

Python does something similar and only tries C.UTF-8, C.utf8 and
"UTF-8".

Once C.UTF-8 is (hopefully) standardized, that will just start
working (tm).

Note that we do not export the fixed LC_CTYPE variable, so external
programs still have to deal with the C locale, but we have no real
business messing with the user's environment.

TODOs:

  • Changes to fish usage are reflected in user documentation/manpages.
  • Tests have been added for regressions fixed
  • User-visible changes noted in CHANGELOG.rst

When fish is started with LC_CTYPE=C (even just effectively, often via
LC_ALL=C!), it's basically broken. There's no way to handle non-ASCII
characters with a C locale unless we want to write our
locale-independent replacements for all of the system functions.

Since we're not going to do that, let's try to find *some locale* for
LC_CTYPE.

We already do that in __fish_setlocale, but that's

- a bit of a weird thing that reads unstandardized system
  configuration files
- allows setting locale to C explicitly

So it's still easily possible to end up in a broken configuration.

Now, the issue with this is that there is (AFAICT) no portable way to
get a list of all allowed locales and C.UTF-8 is not standardized, so
we have no one locale to fall back on and are forced to try a few. The
list we have here is quite arbitrary, but it's a start.

Python does something similar and only tries C.UTF-8, C.utf8 and
"UTF-8".

Once C.UTF-8 is (hopefully) standardized, that will just start
working (tm).

Note that we do not *export* the fixed LC_CTYPE variable, so external
programs still have to deal with the C locale, but we have no real
business messing with the user's environment.
@faho faho added this to the fish 3.3.0 milestone May 25, 2021
@faho
Copy link
Member Author

faho commented May 25, 2021

Oh, also importantly, there is no opt-out of this, because I don't think there is anything that is fixed by having fish use a singlebyte locale. It's possible I'm wrong there.

Copy link
Member

@ridiculousfish ridiculousfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please relnote this of course.

init_locale runs at startup, and also when any of the locale variables change. Do we want to do this even if the user modifies LC_ALL via a set command? If not we could add a param to init_locale indicating if it's being run at startup or not.

src/env_dispatch.cpp Outdated Show resolved Hide resolved
@faho
Copy link
Member Author

faho commented May 26, 2021

init_locale runs at startup, and also when any of the locale variables change. Do we want to do this even if the user modifies LC_ALL via a set command?

The idea here is that an LC_CTYPE of C is broken.

LC_MESSAGES is a matter of which language you speak - do you want your messages in slovenian or chinese? LC_COLLATE is also a matter of regional preference - do you sort "Ä" after "Z" or equivalent to "A"?

By contrast, LC_CTYPE isn't a preference - if you set it to C, we can strictly work with less, and then providing non-ASCII input doesn't work. This is an issue because people tend to default to "LC_ALL=C" as a simple "reset my locale", and then continue providing emoji and such, even in filenames, and we can't work with them because functions like iswalnum just break. E.g. in UTF-8 mode you can set a variable called ö, and in ASCII mode you can't because it doesn't detect "ö" as a valid variable name character!

(this is of course assuming that there's two encodings worth caring about - 8-bit ASCII and UTF-8, and one is a strict subset of the other. On unix that's broadly true and there isn't really a good argument for anything else. I've also not seen anyone work with anything else unless it was a misconfiguration)

So, if a C locale is broken, the logical thing to do is to try to avoid it, even if the user indicated otherwise - and frequently they didn't, actually! They just set LC_ALL=C because that's the reset button. Some bug trackers even suggest reproducing with that to get messages in english, because they can't rely on en_US or en_GB or other variants being available.

@faho
Copy link
Member Author

faho commented May 26, 2021

Also compare python's PEP 538, which does something quite similar to this.

@krobelus
Copy link
Member

krobelus commented May 29, 2021

Great idea! This means that fish should work in a container even without setting a multibyte locale.
Python only does this when the locale is the default C, but this seems fine as well.

@faho
Copy link
Member Author

faho commented May 30, 2021

I'm not entirely sure what to do about __fish_set_locale. It's slow and a bit of a hack (we read a bunch of config files we have no business reading), but it does a few things this doesn't, ATM:

  1. It gives a full locale, with language and everything, in case that's passed wrong (or removed like that awful getty hack)
  2. It gives a proper LC_CTYPE in case none from the list exist, e.g. if you only have fr_FR.UTF-8 built (and no C.UTF-8)
  3. It actually sets the variables instead of just fixing fish

So I'm not sure if we should remove it or keep it. We could extend the list of locales this PR uses to everything glibc provides, which should fix it, to a first approximation, everywhere (but just for fish). Removing __fish_set_locale would be faster and allow us to add a warning in case of a misconfigured and unfixable locale (e.g. LC_ALL set to C, but no C.UTF-8 or anything available).

faho added 2 commits June 6, 2021 08:48
$fish_allow_singlebyte_locale, if set to something true (like "1"),
will re-run the locale initialization and skip the bit where we force
LC_CTYPE to be utf8-capable.

This is mainly used in our tests, but might also be useful if people
are trying to do something weird.
@faho
Copy link
Member Author

faho commented Jun 6, 2021

Alright, I have now added a way to turn it off - setting $fish_allow_singlebyte_locale to 1 (strictly speaking something "bool_from_string" interprets as true).

This is mostly so we can actually test a C locale.

@faho faho merged commit 046db09 into fish-shell:master Jun 6, 2021
@faho faho deleted the fix-locale-harder branch July 29, 2021 15:57
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants