Text literals: Accept unpaired-surrogate escape codes. #9731

kazcw · 2024-04-17T13:44:47Z

Pull Request Description

Unpaired surrogates are not allowed by Unicode, but they occur in practice because many systems accept them; for example, they may be present in filenames on Windows (which are otherwise constrained to UTF-16).

Programs written in Enso should be able to work with them, if only because they represent edge cases that should be tested when converting encodings and at other system boundaries.

Before this change, escapes codes for the range of surrogate codepoints were treated as uninterpretable escapes; they now are interpreted as their specified codepoint values.

(Fixes an issue found while writings tests for #9456)

Important Notes

Generalize the representation of interpreted-text-escapes in the lexer, so that we are not tied to the strict Unicode of Rust's str.
Move some doc-comment code from the parser to test utilities.
Simplify token serialization.

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

The documentation has been updated, if necessary.
Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
All code follows the
Scala,
Java,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
All code has been tested:
- Unit tests have been written where possible.
- If GUI codebase was changed, the GUI was tested when built using ./run ide build.

Unpaired surrogates are not allowed by Unicode, but they occur in practice because many systems accept them; for example, they may be present in filenames on Windows (which are otherwise constrained to UTF-16). Programs written in Enso should be able to work with them, if only because they represent edge cases that should be tested when converting encodings and at other system boundaries. - Generalize the representation of interpreted-text-escapes in the lexer, so that we are not tied to the strict Unicode of Rust's `str`. - Move some doc-comment code from the parser to test utilities. - Simplify token serialization.

JaroslavTulach

Using Escape 13 instead of Escape '\n' seems fine to me.

kazcw self-assigned this Apr 17, 2024

kazcw marked this pull request as ready for review April 17, 2024 13:45

kazcw requested review from mwu-tow, farmaazon, vitvakatu, Frizi and JaroslavTulach as code owners April 17, 2024 13:45

kazcw added the CI: No changelog needed Do not require a changelog entry for this PR. label Apr 17, 2024

kazcw added 2 commits April 17, 2024 06:53

Fix

8b4afdc

Update tests

c1b38db

kazcw mentioned this pull request Apr 17, 2024

Copy/paste improvements #9734

Merged

5 tasks

JaroslavTulach requested a review from radeusgd April 18, 2024 04:52

JaroslavTulach approved these changes Apr 18, 2024

View reviewed changes

farmaazon approved these changes Apr 18, 2024

View reviewed changes

kazcw merged commit 0de490b into develop Apr 18, 2024
36 checks passed

kazcw deleted the wip/kw/escape-lexing branch April 18, 2024 13:21

enso-bot bot mentioned this pull request Apr 19, 2024

Generate completion of Table.join join criteria using data from both joined tables #5629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text literals: Accept unpaired-surrogate escape codes. #9731

Text literals: Accept unpaired-surrogate escape codes. #9731

kazcw commented Apr 17, 2024

JaroslavTulach left a comment

Text literals: Accept unpaired-surrogate escape codes. #9731

Text literals: Accept unpaired-surrogate escape codes. #9731

Conversation

kazcw commented Apr 17, 2024

Pull Request Description

Important Notes

Checklist

JaroslavTulach left a comment

Choose a reason for hiding this comment