Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escapes are broken #521

Open
jiribenes opened this issue Jul 20, 2024 · 3 comments
Open

Escapes are broken #521

jiribenes opened this issue Jul 20, 2024 · 3 comments
Labels
bug Something isn't working epic Super-issue for a whole class of features quality-of-life requires-design

Comments

@jiribenes
Copy link
Contributor

jiribenes commented Jul 20, 2024

Currently, unicode escapes themselves are valid characters which is somewhat confusing, I'd expect the following escape to have to be written as '\u0a00' instead of \u0a00:
image

I'd propose only allowing unicode escapes inside of string or char literals.

Here's the relevant code:

case '\\' if nextMatches("u") => matchUnicodeLiteral()

/** Matches a character: '.+' */
def matchChar(): TokenKind = {
if (peekMatches(_ == '\'')) err("Empty character literal.")
consumeWhile(_ != '\'')
expect('\'')
// exclude opening and closing '
val cs = slice(start + 1, current - 1)
TokenKind.Chr(cs.codePointAt(0))
}
lazy val hexadecimal = ('a' to 'f') ++ ('A' to 'F') ++ ('0' to '9')
/** \u<HEXADECIMAL>+ */
def matchUnicodeLiteral(): TokenKind = {
consumeWhile(hexadecimal.contains(_))
val n = slice(start + 2, current)
try {
TokenKind.Chr(Integer.parseInt(n, 16))
} catch {
case e: Throwable => err("Invalid unicode literal.")
}
}


Related to this: we allow wildly nonstandard escapes in string literals ("anything goes"):

case Some('\\') => {
expect('\\')
peek() match {
case Some('\"') | Some('\\') | Some('n') | Some('t') | Some('r') => consume()
case Some('u') => {
consume()
consumeWhile(c => !c.isWhitespace && c != '"')

This requires some design, but I'd prefer to agree on a common format of an escape, for example Zig has:
\u{NNNNNN} as "hexadecimal Unicode scalar value UTF-8 encoded (1 or more digits)"
I like it because it allows multiple digits (subsuming the need for separate \xNN and \uNNNN and \uNNNNNN and whatever else) and it is easy to parse (<escape> ::= '\u{' <hex>+ '}')

I think it's also fine to invest a little bit into lexing escapes properly by ourselves :)

@jiribenes
Copy link
Contributor Author

jiribenes commented Sep 20, 2024

Related:

effekt> '\n' == '\t'
true

cc @marzipankaiser

Let's just please fix the lexer for strings and characters incl. escapes. 🙏

@jiribenes jiribenes changed the title Unicode escapes outside literals Escapes are broken Sep 30, 2024
@jiribenes jiribenes added the bug Something isn't working label Sep 30, 2024
@jiribenes
Copy link
Contributor Author

Another issue: val c = '\' could have two possible meanings:

  1. a character that is not completely parsed yet that contains an escaped ' inside
  2. a character that completely parsed and contains just \ inside

Currently, we choose 2, but it’s inconsistent with what I presumed when making the highlighting (1 — what Scala does), and with common expectations of programmers & LLMs.
We agreed on Slack to ban this as an unclosed character and require writing val backslash = '\\' for a backslash and val apostrophe = '\'' for an apostrophe.

@jiribenes jiribenes added the epic Super-issue for a whole class of features label Sep 30, 2024
@marzipankaiser
Copy link
Contributor

Another issue: val c = '\' could have two possible meanings:

  1. a character that is not completely parsed yet that contains an escaped ' inside
  2. a character that completely parsed and contains just \ inside

Currently, we choose 2, but it’s inconsistent with what I presumed when making the highlighting (1 — what Scala does), and with common expectations of programmers & LLMs. We agreed on Slack to ban this as an unclosed character and require writing val backslash = '\\' for a backslash and val apostrophe = '\'' for an apostrophe.

In #622, I changed it to 1.

marzipankaiser added a commit that referenced this issue Oct 2, 2024
This gets us half-way there for escapes in Strings:
- Allow escapes in character literals, i.e. `'\n'` is a newline
character
    - Note: This disallows `'\'` for \, but requires `'\\'`
    - Fixes `'\n' == '\t'`  (cmp. #521)
- Check that character literals consist of only one codepoint
- Report wrong escape sequences in parser, but generate the error in
lexer - this only happens for SOME of those
- We should generate error tokens everywhere instead, different issue
- Escape ourselves in Lexer and build up a string
- Add additional escapes: `\$` (because of unquotes), `\u{...}`
- In chez, escape all characters in one function, use chez's unicode
handling functions for characters beyond 3 octals
   - Fixes #596
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working epic Super-issue for a whole class of features quality-of-life requires-design
Projects
None yet
Development

No branches or pull requests

2 participants