Make \x the same as \X #9247

faho · 2022-09-29T17:34:36Z

Description

Up to now, in normal locales \x was essentially the same as \X, except that it errored if given a value > 0x7f.

That's kind of annoying and useless.

A subtle change is that \xHH now represents the character (if any)
encoded by the byte value "HH", so even for values <= 0x7f if that's
not the same as the ASCII value we would diverge.

I do not believe anyone has ever run fish on a system where that
distinction matters. It isn't a thing for UTF-8, it isn't a thing for
ASCII, it isn't a thing for UTF-16, it isn't a thing for any extended
ASCII scheme - ISO8859-X, it isn't a thing for SHIFT-JIS.

I am reasonably certain we are making that same assumption in other
places.

Fixes #1352

TODOs:

Changes to fish usage are reflected in user documentation/manpages.
Tests have been added for regressions fixed
User-visible changes noted in CHANGELOG.rst

This branch builds on #9245, because otherwise we would encode every \xAB with the ENCODE_DIRECT scheme, which would, among other things, break universal variables.

mqudsi

I was actually just wondering about this when I committed 351500e, as I had completely forgotten we treated \x and \X differently until I reviewed that code. I agree with you on this, I think it's a needless distinction and actually a hinderance because most people would assume the escapes to be case-insensitive (though \u and \U still isn't and \c and \C behave differently, so...).

I'm not sure what backwards compatibility here would look like or how we would find regressions. The only cas I can image would be some sort of function that took unescaped values and then prefixed them with \x before using them as literals, relying on fish to catch out-of-range values? That sounds arcane enough to either not exist or not be worth worrying about 🤷‍♀️.

mqudsi · 2022-09-29T20:47:20Z

doc_src/language.rst

@@ -107,8 +107,7 @@ Some characters cannot be written directly on the command line. For these charac
 - ``\r`` represents the carriage return character.
 - ``\t`` represents the tab character.
 - ``\v`` represents the vertical tab character.
- ``\xHH``, where ``HH`` is a hexadecimal number, represents the ASCII character with the specified value. For example, ``\x9`` is the tab character.
- ``\XHH``, where ``HH`` is a hexadecimal number, represents a byte of data with the specified value. If you are using a multibyte encoding, this can be used to enter invalid strings. Only use this if you know what you are doing.
+- ``\xHH`` or ``\XHH``, where ``HH`` is a hexadecimal number, represents a byte of data with the specified value. For example, ``\x9`` is the tab character. If you are using a multibyte encoding, this can be used to enter invalid strings. Typically fish is run with the ASCII or UTF-8 encoding, so anything up to ``\X7f`` is an ASCII character.


While we're cleaning the docs up here, can we just change "If you are using a multibyte encoding, this can be used to enter invalid strings." to "Note that this can be used to create invalid strings." ?

The term "multibyte encoding" is rather ambiguous as it has been used to refer to "legacy" non-Unicode encodings, modern variable-byte encodings like UTF-8 and UTF-16, or encodings that always take more than one byte like UTF-32.

It also depends on what "invalid string" means, too. Is it one that isn't portable, one that doesn't display on your terminal, one that isn't mapped to a character in your current codepage, or one that includes control characters?

(Also, note that there's no good reason why we only warn about invalid strings for \x and \X but not for \o, \u, and \U but I don't know what we should do about that.)

(Also, note that there's no good reason why we only warn about invalid strings for \x and \X but not for \o, \u, and \U but I don't know what we should do about that.)

All three of those are interpreted as codepoint numbers.

That means the error case is much smaller - if \u1234 is valid, then \u1235 will also be, except in the single case of \U10FFFF.

In comparison, for \x/\X it depends much more immediately on the system's encoding, and for UTF-8 especially there are a lot of invalid byte values, even between valid ones.

So it is much easier to end up with nonsense, and that seems worth a note. We can add another note that \U10FFFF is the highest valid unicode value, sure.

Note that the case for Unicode values isn't as clear-cut. There are surrogate values that are valid Unicode codepoints but reserved for UTF-16 (0xD800 to 0xDFFF), so using them in a UTF-8 locale would result in an invalid string. I think there are other codepoints (not just byte values) that may be invalid depending on the particular encoding in use.

You're right though, I believe we already cap ingested octal escapes to saturate at 128, otherwise we'd have a problem with any single-byte octal-escaped value over 0o200 (all the way up to 0o777).

I added in the maximum values.

Tbh I'd be entirely okay if we unified \u and \U as well.

I don't see the point in octal escapes at all, but removing those would be an actual backwards-compatibility concern. Put it on the "if I made fish from scratch" pile, I guess, along with \v (has anyone ever used that? who even knows what a "vertical tab" is?)

also \b, I never used that intentionally but I often want to pass the literal backslash+b to rg

faho · 2022-09-30T06:05:20Z

I'm not sure what backwards compatibility here would look like or how we would find regressions.

Something that's rarely stated outright: backwards compatibility only extends to already valid scripts. We can remove errors to our hearts' content.

This is a tokenization error, which means if it happens fish will refuse to run the script entirely.

mqudsi · 2022-09-30T20:29:27Z

This is a tokenization error, which means if it happens fish will refuse to run the script entirely.

Yes, but I guess I was wondering aloud whether someone is crazy enough to have based their own error checking on that. You could conceivably have a function that would previously error out when passed particular inputs that would result in out-of-range values, but now they would sail through. Note that I am not saying we should worry about it, but theoretically, someone could have a function like this:

function print_hex
    echo \x$argv[1]
end

Which previously wouldn't accept an input like "C8" but now would. But like I said, this is being super pedantic and I think simplifying \x vs \X is well worth breaking oddball scripts like this which are sure to be in the extreme minority.

Up to now, in normal locales \x was essentially the same as \X, except that it errored if given a value > 0x7f. That's kind of annoying and useless. A subtle change is that `\xHH` now represents the character (if any) encoded by the byte value "HH", so even for values <= 0x7f if that's not the same as the ASCII value we would diverge. I do not believe anyone has ever run fish on a system where that distinction matters. It isn't a thing for UTF-8, it isn't a thing for ASCII, it isn't a thing for UTF-16, it isn't a thing for any extended ASCII scheme - ISO8859-X, it isn't a thing for SHIFT-JIS. I am reasonably certain we are making that same assumption in other places. Fixes fish-shell#1352

faho · 2022-10-09T13:23:57Z

Note that I am not saying we should worry about it, but theoretically, someone could have a function like this:

They couldn't even do that, as this is a tokenizer error. That function will immediately error out in fish 3.5.1 and current master, both interactively and in a file. You don't even get a chance to call it.

I have left a note in the CHANGELOG that this has changed, but I don't see any real issues with compatibility.

faho added the enhancement label Sep 29, 2022

faho added this to the fish-future milestone Sep 29, 2022

mqudsi reviewed Sep 29, 2022

View reviewed changes

faho force-pushed the no-more-ascii-check branch from a23dc1e to 6be3fc3 Compare October 9, 2022 13:03

faho added 2 commits October 9, 2022 15:05

CHANGELOG

aa001e7

faho force-pushed the no-more-ascii-check branch from 6be3fc3 to aa001e7 Compare October 9, 2022 13:08

faho modified the milestones: fish-future, fish 3.6.0 Oct 9, 2022

faho merged commit b9b0bc7 into fish-shell:master Oct 9, 2022

faho deleted the no-more-ascii-check branch October 26, 2022 10:21

github-actions bot locked as resolved and limited conversation to collaborators Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make \x the same as \X #9247

Make \x the same as \X #9247

faho commented Sep 29, 2022 •

edited

Loading

mqudsi left a comment

mqudsi Sep 29, 2022 •

edited

Loading

faho Sep 30, 2022 •

edited

Loading

mqudsi Sep 30, 2022

faho Oct 9, 2022

krobelus Oct 9, 2022

faho commented Sep 30, 2022

mqudsi commented Sep 30, 2022

faho commented Oct 9, 2022

Make \x the same as \X #9247

Make \x the same as \X #9247

Conversation

faho commented Sep 29, 2022 • edited Loading

Description

TODOs:

mqudsi left a comment

Choose a reason for hiding this comment

mqudsi Sep 29, 2022 • edited Loading

Choose a reason for hiding this comment

faho Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

mqudsi Sep 30, 2022

Choose a reason for hiding this comment

faho Oct 9, 2022

Choose a reason for hiding this comment

krobelus Oct 9, 2022

Choose a reason for hiding this comment

faho commented Sep 30, 2022

mqudsi commented Sep 30, 2022

faho commented Oct 9, 2022

faho commented Sep 29, 2022 •

edited

Loading

mqudsi Sep 29, 2022 •

edited

Loading

faho Sep 30, 2022 •

edited

Loading