-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugzilla issue 24841 - UTF-16 surrogates when used as an escape of a string should hint on error #17047
Conversation
|
Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please see CONTRIBUTING.md for more information. If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment. Bugzilla references
Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub run digger -- build "stable + dmd#17047" |
d8ad3d6
to
304e975
Compare
|
The reason this came up is because Javascript escapes This affects the translation of JSON files over to D. D's escapes use code points rather than code units, which is far saner. |
…f a string should hint on error
304e975
to
6fb4da3
Compare
| @@ -1556,6 +1556,8 @@ class Lexer | |||
| if (ndigits != 2 && !utf_isValidDchar(v)) | |||
| { | |||
| error(loc, "invalid UTF character \\U%08x", v); | |||
| if (v >= 0xD800 && v <= 0xDFFF) | |||
| errorSupplemental("The code unit is a UTF-16 surrogate, is the escape UTF-16 not a Unicode code point?"); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we able to determine with certainty if it is not a code point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if this branch is taken.
UTF-16 surrogates are never valid code points. They are an encoding detail.
I wrote it as a question, because for all we know they could be trying to \x it, rather than trying to encode wchars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we differentiate between \x and \U escapes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but we don't need to.
\x is only two digit hex, it can't represent the surrogates which are 4.
|
I suspect |
An issue that Elias ran into, not realizing that the json file was UTF-16.
So this will improve the error message to hint at why it went wrong for the next person.