Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variables saved with C locale are broken in other locales (was "iTerm2: Wrong $PWD after locale is changed") #2613

Closed
ghost opened this issue Dec 16, 2015 · 32 comments · Fixed by #9473
Assignees
Labels
bug Something that's not working as intended
Milestone

Comments

@ghost
Copy link

ghost commented Dec 16, 2015

Already discussed in another issue: https://gitlab.com/gnachman/iterm2/issues/4083
But it seems more like a problem from fish.

Steps to reproduce:

  1. iTerm2 Preferences > Profiles > General > Working Directory: Reuse previous session's directory
  2. In config.fish, export LANG="en_US.UTF-8"
  3. cd /path/to/Äpfel; and echo $PWD outputs /path/to/Äpfel
  4. Switch to a new tab by pressing command-T
  5. In the new tab, echo $PWD, gets /path/to/A�pfel
  6. pwd outputs: /path/to/Äpfel (normal)

I suspect this is related to Unicode Normalization Form Decomposition (NFD). And here are some JavaScript for comparison:

$ encodeURI('Äpfel'.normalize('NFD'))
"A%CC%88pfel"
$ decodeURI(encodeURI('Äpfel'.normalize('NFD')))
"Äpfel"
$ unescape(encodeURI('Äpfel'.normalize('NFD')))
"A�pfel"

and two fish commands to illustrate the mechanism of unescape above:

$ printf '\xcc\x88'
 ̈
$ printf '\u00cc\u0088'
�
@gnachman
Copy link

I can see from the linked issue that the value of PWD at the time login is run is /Users/mmmmmm/Äpfel (hex value 5057443d2f55736572732f6d6d6d6d6d6d2f41cc887066656c)

@geoff-nixon
Copy link
Contributor

I cannot reproduce this issue. Are you using the iTerm2 nightlies, and a fish built from a recent master?

@geoff-nixon
Copy link
Contributor

But assuming this is indeed some edge case and valid, my assumption would be that this is neither an iTerm2 issue or a fish issue, but an issue with HFS+ (which, as you mention) is always decomposed via NFD in theory.

Do you experience this issue using /usr/bin/printf, i.e., command printf? (converting all your escape sequences to octal, which is the only escape recognized by POSIX for a printf function, the rest being extensions)?

@geoff-nixon
Copy link
Contributor

@ghost
Copy link
Author

ghost commented Dec 29, 2015

@geoff-codes iTerm2 (Build 2.1.4) and nightlies; fish 2.2.0 and master.

Do you experience this issue using /usr/bin/printf, i.e., command printf? (converting all your escape sequences to octal, which is the only escape recognized by POSIX for a printf function, the rest being extensions)?

?? (The printf in my example above is just for illustration.)

Ha: just found this: https://www.bountysource.com/issues/987904-unicode-normalization-issues-with-hfs

Thanks. (It is issue #474.) But I think it is just another issue about NFD, and that is not a bug in fish.

Please see this example:

# Type Ä (NFC) with the keyboard.
$ printf 'Ä' | od -A x -t x1z -v
000000 c3 84                                            >�.<
000002

$ touch 'Ä'

# Ä (NFD) is from file/directory name auto-completion. Don't copy&paste.
$ printf 'Ä' | od -A x -t x1z -v
000000 41 cc 88                                         >A�.<
000003

So OS X is actually using the NFD form for filenames. However, fish gets the correct byte sequence 41cc88 from iTerm2 (see @gnachman 's comment), but sets the wrong bytes in $PWD (neither NFC or NFD form of Ä).

It will be great if anyone can tell what is the exact source code of fish to initialize the environment variables. I guess the problem occurs during the conversion between wchar_t * and char *.

@geoff-nixon
Copy link
Contributor

Do you experience this issue using /usr/bin/printf, i.e., command printf? (converting all your escape sequences to octal, which is the only escape recognized by POSIX for a printf function, the rest being extensions)?

?? (The printf in my example above is just for illustration.)

It is somewhat relevant. fish's builtin printf, GNU coreutils printf, and the system /usr/bin/printf all have a different set of features, etc. I can tell by your od invocation that you have GNU coreutils linked (and not g-prefixed) in your path before the system utilities, which can cause a lot of trouble; I'd advise you consider installing coreutils like homebrew does; configure with --with-program-prefix=g, so GNU echo and GNU printf become 'gecho' and 'gprintf'.

But in any event, I simply cannot reproduce your issue.

unicodelookgoodtome

tab

- What is your `__CF_USER_TEXT_ENCODING`? - Do you have HFS+ normalization on or off in in iTerm? - What is your Font and "Non-ASCII Font" set to? - Do you have any pre-exec scripts or functions installed that might be messing with $PWD?

@ghost
Copy link
Author

ghost commented Dec 30, 2015

@geoff-codes Sorry, I didn't mentioned the gnu od. The builtin printf overrides my gnu printf, so I still don't know what you mean. Besides, I build fish with PATH="/usr/bin:/usr/sbin:/bin:/sbin"

  • What is your __CF_USER_TEXT_ENCODING? Undefined
  • Do you have HFS+ normalization on or off in in iTerm? OFF
  • What is your Font and "Non-ASCII Font" set to? Only ASCII: Menlo
  • Do you have any pre-exec scripts or functions installed that might be messing with $PWD? No. PWD is read only.

After testing for a while, I found that my problem occurs right after the locale is changed from undefined (default: C?) to en_US.UTF-8. I also configure iTerm2 to use ~/Äpfel as the default directory for new sessions. The title of this issue should be changed:

config.fish

locale
echo (basename $PWD)
set -gx LANG en_US.UTF-8  #OR: set -gx LC_ALL en_US.UTF-8
echo (basename $PWD)
locale

output:

LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Äpfel
A�pfel
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

Luckily, problem solved with this setting in iTerm2 (still needs some investigation, @gnachman ):

iTerm2 Preferences > Profiles > Terminal > Environment > Set locale variables automatically

locale
echo (basename $PWD)
set -gx LANG en_US.UTF-8  #OR: set -gx LC_ALL en_US.UTF-8
echo (basename $PWD)
locale

output:

LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Äpfel
Äpfel
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

@ghost ghost changed the title iTerm2: Wrong $PWD from previous session with a UTF-8 locale iTerm2: Wrong $PWD after locale is changed Dec 30, 2015
@krader1961 krader1961 self-assigned this Mar 25, 2016
@krader1961 krader1961 added the bug Something that's not working as intended label Mar 25, 2016
@krader1961
Copy link
Contributor

@jakwings: Does this still occur with a fish built from git head? I can't reproduce using code that includes recent changes to the __fish_urlencode function plus my changes to how the C locale is supported.

@ghost
Copy link
Author

ghost commented Mar 25, 2016

Yes.

Like my last comment above it seems like a bug about how fish deals with the change of locale. It is basically like:

# first in the C locale
cd ~/A\Xcc\X88pfel
# ouch
set -lx LANG en_US.UTF-8

Luckily iTerm2 has an option to start fish with an automatically set locale (LC_CTYPE=UTF-8 or something else dependent on your OS's language). So I can avoid the change of locale.

@krader1961
Copy link
Contributor

Okay, but now you're describing a different issue than you originally reported. The issue you're now reporting is because some cached strings aren't re-encoded when the locale changes. I explicitly did not "fix" that behavior and said so in my issue regarding implementing better support for the C locale. The original issue you reported seems to be fixed AFAICT so I'm going to close this.

If someone feels strongly about supporting changing between UTF-8 and C locales when the PWD is a non-ASCII path feel free to open an issue. However, I would expect that issue to illustrate a real-world example of how the incomplete support for changing locales in that manner is a problem, not just a hypothetical example.

@ghost
Copy link
Author

ghost commented Mar 25, 2016

Okay, but now you're describing a different issue than you originally reported.

No, I believe they are the same thing, because config.fish is loaded after fish is started with the C locale. And you did not ever "fix" it.

If someone feels strongly about supporting changing between UTF-8 and C locales when the PWD is a non-ASCII path feel free to open an issue.

Anyway, this can be a reason to close this issue if fish doesn't care much about it.

@ghost
Copy link
Author

ghost commented Mar 25, 2016

However, I would expect that issue to illustrate a real-world example of how the incomplete support for changing locales in that manner is a problem, not just a hypothetical example.

Hypothetical? So I'm not a real human with real life, sorry. I don't know why I'll notice this line of text. Feel like I'm still the first time to meet you.

@faho
Copy link
Member

faho commented Mar 25, 2016

Hypothetical? So I'm not a real human with real life, sorry. I don't know why I'll notice this line of text. Feel like I'm still the first time to meet you.

I think what @krader1961 was trying to say is that setting locale in a running fish doesn't seem like a hugely useful thing - you can always start another fish.

Though if it's actually still an issue if you're setting locale in config.fish, that's a bigger problem. I think I tested this a while ago and it worked for me. Hold on and let me test.

@krader1961
Copy link
Contributor

was trying to say is that setting locale in a running fish doesn't seem like a hugely useful thing

Right. In particular, starting in the C locale in a situation where non-ASCII data is present in the PWD and expecting sane behavior by switching to a UTF-8 locale.

@krader1961
Copy link
Contributor

If that scenario is something that is biting people in the ass then, yes, we should try to handle it better. But I still contend that the original problem statement has been fixed. What you're describing now is closely related but is not the same problem.

@floam
Copy link
Member

floam commented Mar 25, 2016

I imagine anyone who hits this is changing the locale in response to the non-ASCII data in the PWD/environment, with the expectation that it should fix things up or that it's the kosher thing to do, to get off the C locale once they have wide characters happening, to have the locale agree with the character set. They aren't just doing random crazy shit to try to break it. That's a reasonable thing to want fixed. If I understand his transcript it looks like fixing the locale has a NOT AWESOME seemingly punishing, paradoxical effect on his PWD.

That sounds like a reasonable issue to want to see solved. I don't know what the "original problem statement" is supposed to be, but what I just described seems to fit his examples and descriptions.

Is there currently a bug with the PWD getting messed up if that situation occurs? If no, I don't understand how it is helpful to close this.

@floam
Copy link
Member

floam commented Mar 25, 2016

screenshot 2016-03-25 at 3 29 17 pm

@floam
Copy link
Member

floam commented Mar 25, 2016

@jakwings: Does this still occur with a fish built from git head? I can't reproduce using code that includes recent changes to the __fish_urlencode function plus my changes to how the C locale is supported.

Oops. I am a little out of date. I'll try with HEAD.

@faho
Copy link
Member

faho commented Mar 25, 2016

Humm.... there's something weird here...

Unfortunately I can't test the actual iTerm2 stuff (no OSX machine), but it seems the issue might not lie with __fish_urlencode, but __update_vte_cwd. __fish_urlencode prints the same thing regardless of locale (unsurprising, it sets LC_ALL=C), but __update_vte_cwd doesn't, so it might confuse the terminal.

Can anyone of you try how it behaves if you put set -lx LC_ALL C into __update_vte_cwd? (This will have to be done in __fish_config_interactive.fish itself)

@krader1961
Copy link
Contributor

@floam: Your example is the same as the most recent example by @jakwings. We know that doesn't work. The original problem description, and the one it referred to say nothing about starting fish in the C locale. So I assumed this was the same issue reported elsewhere (don't have the number handy) that this didn't work at all for non-ASCII chars.

Now, if it is in fact the case that the original problem description inadvertently omitted that the C locale is in effect when fish is started and need be able to switch to a UTF-8 locale on the fly (or in ~/.config/fish/config.fish which is the same thing) then this should be reopened. But I'd be really curious why the default locale isn't a UTF-8 variant if you know your system contains unicode data?

@krader1961 krader1961 reopened this Mar 25, 2016
@floam
Copy link
Member

floam commented Mar 25, 2016

We know that doesn't work.

We do? HEAD acts different in my case. It's more difficult to use on account of the wide character errors, but setting the correct locale makes the PWD render correctly.

@krader1961 krader1961 removed their assignment Mar 25, 2016
@krader1961
Copy link
Contributor

@faho: The problem that @jakwings is now (originally?) reporting is that if PWD already contains a non-ASCII, UTF-8, string then switching to a UTF-8 locale without doing another cd does not re-encode PWD. That causes __update_vte_cwd to emit a non-sensical URI. You can see what's happening through a simpler example:

$ env LC_ALL=C fish
$ cd /tmp/Ä̈̈pfel/
$ echo $PWD|xxd
00000000: 2f70 7269 7661 7465 2f74 6d70 2f41 cc88  /private/tmp/A..
00000010: 7066 656c 0a                             pfel.
$ set -x LC_ALL en_US.UTF-8
$ echo $PWD|xxd
00000000: 2f70 7269 7661 7465 2f74 6d70 2f41 c38c  /private/tmp/A..
00000010: c288 7066 656c 0a                        ..pfel.

Note that simply doing cd . after the locale change "fixes" the problem by forcing PWD to be updated.

$ cd .
$ echo $PWD|xxd
00000000: 2f70 7269 7661 7465 2f74 6d70 2f41 cc88  /private/tmp/A..
00000010: 7066 656c 0a                             pfel.

@krader1961
Copy link
Contributor

Please note that while this discussion has been focused on PWD the problem is more general than that. We could special-case a solution just for PWD (by using a function to monitor changes to the locale vars and forcing a cd .) but that wouldn't solve the general problem.

@floam
Copy link
Member

floam commented Mar 25, 2016

Here's trying to do roughly the same thing on HEAD. It's awful because those errors mess with my cursor position and spawn themselves just as I type, go on for pages.

But you can see that at the end the result is different. Changing the locale to UTF8 makes the string render.

screenshot 2016-03-25 at 3 52 45 pm
screenshot 2016-03-25 at 3 52 22 pm

@floam
Copy link
Member

floam commented Mar 25, 2016

I'm changing the wide character errors to be a little bit less Shock and Awe in presentation. The blitzkrieg printing of that error, one line for each and any character, while the user is interacting is an obvious thing to tone down and throttle/filter as it's usually a worse thing to experience than the than the actual bug.

@ghost
Copy link
Author

ghost commented Mar 26, 2016

@krader1961 You are really good at playing with words.

The original problem description, and the one it referred to say nothing about starting fish in the C locale.

So as a user, can I tell you what exactly happened and let you understand my problem? It is so ridiculous for you to quit discussion just for an obscure description.

… then switching to a UTF-8 locale without doing another cd does not re-encode PWD. That causes __update_vte_cwd to emit a non-sensical URI. You can see what's happening through a simpler example

So cd (pwd) is the best way to solve this issue? Holy shit. I hope no one will ever play with that LC_ALL=C trick and stumble over it in the future. This is a nice feature of fish.

Can you provide an example to tell me __update_vte_cmd did run?

Please note that while this discussion has been focused on PWD the problem is more general than that. We could special-case a solution just for PWD (by using a function to monitor changes to the locale vars and forcing a cd .) but that wouldn't solve the general problem.

I wish you good luck.

@ghost ghost closed this as completed Mar 26, 2016
@ghost ghost reopened this Mar 26, 2016
@geoff-nixon
Copy link
Contributor

I'm very confused as to what the issue here even is anymore. One should not be setting a C locale for an interactive shell on OS X. Mac OS X precomposes UTF-8 at the device level, and HFS+ saves filenames as decomposed UTF-8. If I set a C locale manually in either/both iTerm and Terminal.app, using either fish or bash, weird stuff starts happening.

Why is this unexpected? How is this a fish issue? Is there some reason one would need to use fish (the friendly interactive shell) on Mac OS X in a C locale?

@ghost
Copy link
Author

ghost commented Mar 26, 2016

@geoff-codes

If I set a C locale manually in either/both iTerm and Terminal.app, using either fish or bash, weird stuff starts happening.

Welcome to open a new issue to describe the details. Besides, fish is not bash.

Why is this unexpected? How is this a fish issue? Is there some reason one would need to use fish (the friendly interactive shell) on Mac OS X in a C locale?

I don't know. And fish doesn't document it. Any proposal for the behavior of fish under a c locale on OS X? New issue welcomed.

@floam
Copy link
Member

floam commented Mar 26, 2016

If I set a C locale manually in either/both iTerm and Terminal.app, using either fish or bash, weird stuff starts happening.

It works here just fine for me in zsh and bash. You do want to make sure you pick the encoding in Terminal.app, check the box for it to set the locale env vars for you, also check that it is set to escape non-ASCII characters with ^V.

And by works I mean: I'm limited to like 128 characters, it's really boring. And of course my file with chiense characters in it looks like jibberish, like they did in 2002 on my slackware box before I started recompiling stuff with bleeding edge unicode support where I could.

What it doesn't do is cause errors and if I switch to a different encoding things just work (in zsh/bash).

@floam floam self-assigned this Sep 6, 2016
@floam floam added this to the fish-future milestone Sep 6, 2016
@mqudsi
Copy link
Contributor

mqudsi commented Sep 21, 2018

I'm inclined to agree with @geoff-codes here.

My preference here would be to lock down LANG, LC_ALL, etc. and emit an error message when a user tries to set them to the effect of "the locale cannot be changed after starting fish, if you wish to use fish with a different locale, start a new fish session with env LANG=... fish."?

fish assumes some flavor of unicode encoding throughout. If on a system that doesn't have unicode support, well, that's why utf-8 is backwards-compatible with the ansi/ascii character set, and a lot of unicode-based functionality won't work. But there's no reason to start fish in one locale, then change the locale for fish itself in the middle of a session.

@zanchey
Copy link
Member

zanchey commented Oct 31, 2018

I do it for testing purposes; it also makes it difficult to start another program under a certain locale without jumping through the env hoop.

@faho faho changed the title iTerm2: Wrong $PWD after locale is changed Variables saved with C locale are broken in other locales (was "iTerm2: Wrong $PWD after locale is changed") Mar 18, 2019
@faho
Copy link
Member

faho commented Mar 18, 2019

Okay, let's restart this one.

First of all, there's no need for iTerm or even $PWD to reproduce any of this. Simply do

$ set -gx LC_ALL C
$ set -l var …
$ set -gx LC_ALL en_US.UTF-8
$ echo $var
â¦

This seems to be caused by us converting the value to a "wide" string while in C locale by doing... well, nothing, really (

fish-shell/src/common.cpp

Lines 285 to 292 in 03454b7

if (MB_CUR_MAX == 1) {
// Single-byte locale, all values are legal.
while (in_pos < in_len) {
result.push_back((unsigned char)in[in_pos]);
in_pos++;
}
return result;
}
):

    if (MB_CUR_MAX == 1) {
        // Single-byte locale, all values are legal.
        while (in_pos < in_len) {
            result.push_back((unsigned char)in[in_pos]);
            in_pos++;
        }
        return result;
    }

we simply copy the byte values and call it a day, storing it in memory. Then the locale changes, and we read it as if it were saved in a multibyte locale.

In my tests, simply removing that code makes it work (well, there's weirdness in that output.cpp duplicates wcs2str and complains about these. Changing it to use wcs2str, like it probably should anyway, removes the errors).

@ridiculousfish: Since you're more knowledgeable about all of this, does that sound sensible to you? Is there any reason why str2wcs in a singlebyte locale wouldn't just work? If mbrtowc wouldn't work, I'd assume it would return an error, at which point we fall back to encoding direct, which should be locale-independent?

dlukes added a commit to dlukes/dotfiles that referenced this issue Feb 22, 2020
Otherwise issues can manifest with non-ASCII characters, e.g.
disappearing prompt in a local fish -> SSH -> remote fish scenario. Cf.
fish-shell/fish-shell#2613.
dlukes added a commit to dlukes/go-fish that referenced this issue Feb 22, 2020
Otherwise issues can manifest with non-ASCII characters, e.g.
disappearing prompt in a local fish -> SSH -> remote fish scenario. Cf.
fish-shell/fish-shell#2613.
dlukes added a commit to dlukes/dotfiles that referenced this issue Feb 22, 2020
Otherwise issues can manifest with non-ASCII characters, e.g.
disappearing prompt in a local fish -> SSH -> remote fish scenario. Cf.
fish-shell/fish-shell#2613.
faho added a commit to faho/fish-shell that referenced this issue Jan 14, 2023
This meant we didn't actually do our weird en/decoding scheme for e.g.
a C locale, which meant that, when you then switch to a proper locale
the previous variables were broken.

I don't know how to test this automatically - none of my attempts seem
to ever *fail* with the old code, here's what you'd do manually:

- Run fish with an actual C locale (LC_ALL=C
fish_allow_singlebyte_locale=1 fish)
- `set -gx foo 💩`
- `set -e LC_ALL`
- `echo $foo` outputs "💩" if it works and "ð⏎" if it's broken.

Fixes fish-shell#2613
faho added a commit that referenced this issue Jan 14, 2023
This meant we didn't actually do our weird en/decoding scheme for e.g.
a C locale, which meant that, when you then switch to a proper locale
the previous variables were broken.

I don't know how to test this automatically - none of my attempts seem
to ever *fail* with the old code, here's what you'd do manually:

- Run fish with an actual C locale (LC_ALL=C
fish_allow_singlebyte_locale=1 fish)
- `set -gx foo 💩`
- `set -e LC_ALL`
- `echo $foo` outputs "💩" if it works and "ð⏎" if it's broken.

Fixes #2613
@faho faho modified the milestones: fish-future, fish 3.6.1 Jan 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that's not working as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants