Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make \x the same as \X #9247

Merged
merged 2 commits into from
Oct 9, 2022
Merged

Make \x the same as \X #9247

merged 2 commits into from
Oct 9, 2022

Conversation

faho
Copy link
Member

@faho faho commented Sep 29, 2022

Description

Up to now, in normal locales \x was essentially the same as \X, except that it errored if given a value > 0x7f.

That's kind of annoying and useless.

A subtle change is that \xHH now represents the character (if any)
encoded by the byte value "HH", so even for values <= 0x7f if that's
not the same as the ASCII value we would diverge.

I do not believe anyone has ever run fish on a system where that
distinction matters. It isn't a thing for UTF-8, it isn't a thing for
ASCII, it isn't a thing for UTF-16, it isn't a thing for any extended
ASCII scheme - ISO8859-X, it isn't a thing for SHIFT-JIS.

I am reasonably certain we are making that same assumption in other
places.

Fixes #1352

TODOs:

  • Changes to fish usage are reflected in user documentation/manpages.
  • Tests have been added for regressions fixed
  • User-visible changes noted in CHANGELOG.rst

This branch builds on #9245, because otherwise we would encode every \xAB with the ENCODE_DIRECT scheme, which would, among other things, break universal variables.

@faho faho added this to the fish-future milestone Sep 29, 2022
Copy link
Contributor

@mqudsi mqudsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually just wondering about this when I committed 351500e, as I had completely forgotten we treated \x and \X differently until I reviewed that code. I agree with you on this, I think it's a needless distinction and actually a hinderance because most people would assume the escapes to be case-insensitive (though \u and \U still isn't and \c and \C behave differently, so...).

I'm not sure what backwards compatibility here would look like or how we would find regressions. The only cas I can image would be some sort of function that took unescaped values and then prefixed them with \x before using them as literals, relying on fish to catch out-of-range values? That sounds arcane enough to either not exist or not be worth worrying about 🤷‍♀️.

@@ -107,8 +107,7 @@ Some characters cannot be written directly on the command line. For these charac
- ``\r`` represents the carriage return character.
- ``\t`` represents the tab character.
- ``\v`` represents the vertical tab character.
- ``\xHH``, where ``HH`` is a hexadecimal number, represents the ASCII character with the specified value. For example, ``\x9`` is the tab character.
- ``\XHH``, where ``HH`` is a hexadecimal number, represents a byte of data with the specified value. If you are using a multibyte encoding, this can be used to enter invalid strings. Only use this if you know what you are doing.
- ``\xHH`` or ``\XHH``, where ``HH`` is a hexadecimal number, represents a byte of data with the specified value. For example, ``\x9`` is the tab character. If you are using a multibyte encoding, this can be used to enter invalid strings. Typically fish is run with the ASCII or UTF-8 encoding, so anything up to ``\X7f`` is an ASCII character.
Copy link
Contributor

@mqudsi mqudsi Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're cleaning the docs up here, can we just change "If you are using a multibyte encoding, this can be used to enter invalid strings." to "Note that this can be used to create invalid strings." ?

The term "multibyte encoding" is rather ambiguous as it has been used to refer to "legacy" non-Unicode encodings, modern variable-byte encodings like UTF-8 and UTF-16, or encodings that always take more than one byte like UTF-32.

It also depends on what "invalid string" means, too. Is it one that isn't portable, one that doesn't display on your terminal, one that isn't mapped to a character in your current codepage, or one that includes control characters?

(Also, note that there's no good reason why we only warn about invalid strings for \x and \X but not for \o, \u, and \U but I don't know what we should do about that.)

Copy link
Member Author

@faho faho Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, note that there's no good reason why we only warn about invalid strings for \x and \X but not for \o, \u, and \U but I don't know what we should do about that.)

All three of those are interpreted as codepoint numbers.

That means the error case is much smaller - if \u1234 is valid, then \u1235 will also be, except in the single case of \U10FFFF.

In comparison, for \x/\X it depends much more immediately on the system's encoding, and for UTF-8 especially there are a lot of invalid byte values, even between valid ones.

So it is much easier to end up with nonsense, and that seems worth a note. We can add another note that \U10FFFF is the highest valid unicode value, sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the case for Unicode values isn't as clear-cut. There are surrogate values that are valid Unicode codepoints but reserved for UTF-16 (0xD800 to 0xDFFF), so using them in a UTF-8 locale would result in an invalid string. I think there are other codepoints (not just byte values) that may be invalid depending on the particular encoding in use.

You're right though, I believe we already cap ingested octal escapes to saturate at 128, otherwise we'd have a problem with any single-byte octal-escaped value over 0o200 (all the way up to 0o777).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added in the maximum values.

Tbh I'd be entirely okay if we unified \u and \U as well.

I don't see the point in octal escapes at all, but removing those would be an actual backwards-compatibility concern. Put it on the "if I made fish from scratch" pile, I guess, along with \v (has anyone ever used that? who even knows what a "vertical tab" is?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also \b, I never used that intentionally but I often want to pass the literal backslash+b to rg

@faho
Copy link
Member Author

faho commented Sep 30, 2022

I'm not sure what backwards compatibility here would look like or how we would find regressions.

Something that's rarely stated outright: backwards compatibility only extends to already valid scripts. We can remove errors to our hearts' content.

This is a tokenization error, which means if it happens fish will refuse to run the script entirely.

@mqudsi
Copy link
Contributor

mqudsi commented Sep 30, 2022

This is a tokenization error, which means if it happens fish will refuse to run the script entirely.

Yes, but I guess I was wondering aloud whether someone is crazy enough to have based their own error checking on that. You could conceivably have a function that would previously error out when passed particular inputs that would result in out-of-range values, but now they would sail through. Note that I am not saying we should worry about it, but theoretically, someone could have a function like this:

function print_hex
    echo \x$argv[1]
end

Which previously wouldn't accept an input like "C8" but now would. But like I said, this is being super pedantic and I think simplifying \x vs \X is well worth breaking oddball scripts like this which are sure to be in the extreme minority.

Up to now, in normal locales \x was essentially the same as \X, except
that it errored if given a value > 0x7f.

That's kind of annoying and useless.

A subtle change is that `\xHH` now represents the character (if any)
encoded by the byte value "HH", so even for values <= 0x7f if that's
not the same as the ASCII value we would diverge.

I do not believe anyone has ever run fish on a system where that
distinction matters. It isn't a thing for UTF-8, it isn't a thing for
ASCII, it isn't a thing for UTF-16, it isn't a thing for any extended
ASCII scheme - ISO8859-X, it isn't a thing for SHIFT-JIS.

I am reasonably certain we are making that same assumption in other
places.

Fixes fish-shell#1352
@faho faho modified the milestones: fish-future, fish 3.6.0 Oct 9, 2022
@faho
Copy link
Member Author

faho commented Oct 9, 2022

Note that I am not saying we should worry about it, but theoretically, someone could have a function like this:

They couldn't even do that, as this is a tokenizer error. That function will immediately error out in fish 3.5.1 and current master, both interactively and in a file. You don't even get a chance to call it.

I have left a note in the CHANGELOG that this has changed, but I don't see any real issues with compatibility.

@faho faho merged commit b9b0bc7 into fish-shell:master Oct 9, 2022
@faho faho deleted the no-more-ascii-check branch October 26, 2022 10:21
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider extending \x to support non-ASCII characters
3 participants