Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text literals: Accept unpaired-surrogate escape codes. #9731

Merged
merged 3 commits into from
Apr 18, 2024

Conversation

kazcw
Copy link
Contributor

@kazcw kazcw commented Apr 17, 2024

Pull Request Description

Unpaired surrogates are not allowed by Unicode, but they occur in practice because many systems accept them; for example, they may be present in filenames on Windows (which are otherwise constrained to UTF-16).

Programs written in Enso should be able to work with them, if only because they represent edge cases that should be tested when converting encodings and at other system boundaries.

Before this change, escapes codes for the range of surrogate codepoints were treated as uninterpretable escapes; they now are interpreted as their specified codepoint values.

(Fixes an issue found while writings tests for #9456)

Important Notes

  • Generalize the representation of interpreted-text-escapes in the lexer, so that we are not tied to the strict Unicode of Rust's str.
  • Move some doc-comment code from the parser to test utilities.
  • Simplify token serialization.

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed, the GUI was tested when built using ./run ide build.

Unpaired surrogates are not allowed by Unicode, but they occur in practice
because many systems accept them; for example, they may be present in filenames
on Windows (which are otherwise constrained to UTF-16).

Programs written in Enso should be able to work with them, if only because they
represent edge cases that should be tested when converting encodings and at
other system boundaries.

- Generalize the representation of interpreted-text-escapes in the lexer, so
  that we are not tied to the strict Unicode of Rust's `str`.
- Move some doc-comment code from the parser to test utilities.
- Simplify token serialization.
@kazcw kazcw self-assigned this Apr 17, 2024
@kazcw kazcw marked this pull request as ready for review April 17, 2024 13:45
@kazcw kazcw added the CI: No changelog needed Do not require a changelog entry for this PR. label Apr 17, 2024
@kazcw kazcw mentioned this pull request Apr 17, 2024
5 tasks
Copy link
Member

@JaroslavTulach JaroslavTulach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Escape 13 instead of Escape '\n' seems fine to me.

@kazcw kazcw merged commit 0de490b into develop Apr 18, 2024
36 checks passed
@kazcw kazcw deleted the wip/kw/escape-lexing branch April 18, 2024 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: No changelog needed Do not require a changelog entry for this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants