-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make \x the same as \X #9247
Make \x the same as \X #9247
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually just wondering about this when I committed 351500e, as I had completely forgotten we treated \x and \X differently until I reviewed that code. I agree with you on this, I think it's a needless distinction and actually a hinderance because most people would assume the escapes to be case-insensitive (though \u and \U still isn't and \c and \C behave differently, so...).
I'm not sure what backwards compatibility here would look like or how we would find regressions. The only cas I can image would be some sort of function that took unescaped values and then prefixed them with \x
before using them as literals, relying on fish to catch out-of-range values? That sounds arcane enough to either not exist or not be worth worrying about 🤷♀️.
@@ -107,8 +107,7 @@ Some characters cannot be written directly on the command line. For these charac | |||
- ``\r`` represents the carriage return character. | |||
- ``\t`` represents the tab character. | |||
- ``\v`` represents the vertical tab character. | |||
- ``\xHH``, where ``HH`` is a hexadecimal number, represents the ASCII character with the specified value. For example, ``\x9`` is the tab character. | |||
- ``\XHH``, where ``HH`` is a hexadecimal number, represents a byte of data with the specified value. If you are using a multibyte encoding, this can be used to enter invalid strings. Only use this if you know what you are doing. | |||
- ``\xHH`` or ``\XHH``, where ``HH`` is a hexadecimal number, represents a byte of data with the specified value. For example, ``\x9`` is the tab character. If you are using a multibyte encoding, this can be used to enter invalid strings. Typically fish is run with the ASCII or UTF-8 encoding, so anything up to ``\X7f`` is an ASCII character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're cleaning the docs up here, can we just change "If you are using a multibyte encoding, this can be used to enter invalid strings." to "Note that this can be used to create invalid strings." ?
The term "multibyte encoding" is rather ambiguous as it has been used to refer to "legacy" non-Unicode encodings, modern variable-byte encodings like UTF-8 and UTF-16, or encodings that always take more than one byte like UTF-32.
It also depends on what "invalid string" means, too. Is it one that isn't portable, one that doesn't display on your terminal, one that isn't mapped to a character in your current codepage, or one that includes control characters?
(Also, note that there's no good reason why we only warn about invalid strings for \x
and \X
but not for \o
, \u
, and \U
but I don't know what we should do about that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, note that there's no good reason why we only warn about invalid strings for \x and \X but not for \o, \u, and \U but I don't know what we should do about that.)
All three of those are interpreted as codepoint numbers.
That means the error case is much smaller - if \u1234
is valid, then \u1235
will also be, except in the single case of \U10FFFF
.
In comparison, for \x
/\X
it depends much more immediately on the system's encoding, and for UTF-8 especially there are a lot of invalid byte values, even between valid ones.
So it is much easier to end up with nonsense, and that seems worth a note. We can add another note that \U10FFFF
is the highest valid unicode value, sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the case for Unicode values isn't as clear-cut. There are surrogate values that are valid Unicode codepoints but reserved for UTF-16 (0xD800 to 0xDFFF), so using them in a UTF-8 locale would result in an invalid string. I think there are other codepoints (not just byte values) that may be invalid depending on the particular encoding in use.
You're right though, I believe we already cap ingested octal escapes to saturate at 128, otherwise we'd have a problem with any single-byte octal-escaped value over 0o200 (all the way up to 0o777).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added in the maximum values.
Tbh I'd be entirely okay if we unified \u and \U as well.
I don't see the point in octal escapes at all, but removing those would be an actual backwards-compatibility concern. Put it on the "if I made fish from scratch" pile, I guess, along with \v
(has anyone ever used that? who even knows what a "vertical tab" is?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also \b
, I never used that intentionally but I often want to pass the literal backslash+b to rg
Something that's rarely stated outright: backwards compatibility only extends to already valid scripts. We can remove errors to our hearts' content. This is a tokenization error, which means if it happens fish will refuse to run the script entirely. |
Yes, but I guess I was wondering aloud whether someone is crazy enough to have based their own error checking on that. You could conceivably have a function that would previously error out when passed particular inputs that would result in out-of-range values, but now they would sail through. Note that I am not saying we should worry about it, but theoretically, someone could have a function like this: function print_hex
echo \x$argv[1]
end Which previously wouldn't accept an input like "C8" but now would. But like I said, this is being super pedantic and I think simplifying |
a23dc1e
to
6be3fc3
Compare
Up to now, in normal locales \x was essentially the same as \X, except that it errored if given a value > 0x7f. That's kind of annoying and useless. A subtle change is that `\xHH` now represents the character (if any) encoded by the byte value "HH", so even for values <= 0x7f if that's not the same as the ASCII value we would diverge. I do not believe anyone has ever run fish on a system where that distinction matters. It isn't a thing for UTF-8, it isn't a thing for ASCII, it isn't a thing for UTF-16, it isn't a thing for any extended ASCII scheme - ISO8859-X, it isn't a thing for SHIFT-JIS. I am reasonably certain we are making that same assumption in other places. Fixes fish-shell#1352
6be3fc3
to
aa001e7
Compare
They couldn't even do that, as this is a tokenizer error. That function will immediately error out in fish 3.5.1 and current master, both interactively and in a file. You don't even get a chance to call it. I have left a note in the CHANGELOG that this has changed, but I don't see any real issues with compatibility. |
Description
Up to now, in normal locales \x was essentially the same as \X, except that it errored if given a value > 0x7f.
That's kind of annoying and useless.
A subtle change is that
\xHH
now represents the character (if any)encoded by the byte value "HH", so even for values <= 0x7f if that's
not the same as the ASCII value we would diverge.
I do not believe anyone has ever run fish on a system where that
distinction matters. It isn't a thing for UTF-8, it isn't a thing for
ASCII, it isn't a thing for UTF-16, it isn't a thing for any extended
ASCII scheme - ISO8859-X, it isn't a thing for SHIFT-JIS.
I am reasonably certain we are making that same assumption in other
places.
Fixes #1352
TODOs:
This branch builds on #9245, because otherwise we would encode every
\xAB
with the ENCODE_DIRECT scheme, which would, among other things, break universal variables.