Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugzilla issue 24841 - UTF-16 surrogates when used as an escape of a string should hint on error #17047

Merged
merged 1 commit into from
Nov 3, 2024

Conversation

rikkimax
Copy link
Contributor

@rikkimax rikkimax commented Nov 2, 2024

An issue that Elias ran into, not realizing that the json file was UTF-16.

So this will improve the error message to hint at why it went wrong for the next person.

@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close Bugzilla Severity Description
24841 enhancement UTF-16 surrogates when used as an escape of a string should hint on error

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "stable + dmd#17047"

@rikkimax
Copy link
Contributor Author

rikkimax commented Nov 2, 2024

The reason this came up is because Javascript escapes \u are actually UTF-16 code units.

This affects the translation of JSON files over to D.

D's escapes use code points rather than code units, which is far saner.

@@ -1556,6 +1556,8 @@ class Lexer
if (ndigits != 2 && !utf_isValidDchar(v))
{
error(loc, "invalid UTF character \\U%08x", v);
if (v >= 0xD800 && v <= 0xDFFF)
errorSupplemental("The code unit is a UTF-16 surrogate, is the escape UTF-16 not a Unicode code point?");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we able to determine with certainty if it is not a code point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if this branch is taken.

UTF-16 surrogates are never valid code points. They are an encoding detail.

I wrote it as a question, because for all we know they could be trying to \x it, rather than trying to encode wchars.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we differentiate between \x and \U escapes?

Copy link
Contributor Author

@rikkimax rikkimax Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but we don't need to.

\x is only two digit hex, it can't represent the surrogates which are 4.

@rikkimax
Copy link
Contributor Author

rikkimax commented Nov 3, 2024

I suspect Azure pipelines (Windows_DMD_latest x64) is failing due to low memory. It'll need a restart I think.

@thewilsonator thewilsonator merged commit b11b3f3 into dlang:stable Nov 3, 2024
71 of 73 checks passed
@rikkimax rikkimax deleted the fix-issue24841 branch November 7, 2024 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants