Lexer: match escape sequences in strings #290

dvdvgt · 2023-10-13T09:54:26Z

This PR addresses and fixes issue #202.

Of course, I am open to suggestions on how to properly integrate escape sequences into the lexing process as proposed by @b-studios (#202 (comment)):

[...] we should actually design strings and escape sequences at some point.

What changes do you think this entails?

b-studios · 2023-10-13T10:59:57Z

What changes do you think this entails?

Most languages say in their spec what a String is and what an escape sequence is (for instance here for JS: https://tc39.es/ecma262/multipage/text-processing.html#prod-CharacterEscape).

So far we are always "use case driven". That is, we implement what we need to implement a specific use case; not design and implement a standard.

dvdvgt · 2023-10-15T13:56:18Z

So we would need to decide on a grammar for strings and change the lexer accordingly? For example

<string> ::= " <string2>* "
<string2> ::= <escaped> | <char>
<escaped> ::= \" | \n | \t | \r | \v | \f

Currently, a string literal is tokenized as a whole. If we want to have better errors regarding invalid control characters/escape sequences, each of these will be probably need to be its own token. However, the lexer is currently skipping all whitespaces, making it impossible to correctly tokenize a string containing such whitespaces ("a b" would be tokenized as ", a, b). Therefore, we either need to accept the updated regex for string literals or change lexing such that whitespaces are not ignored. Or am I perhaps missing something?

What are your thoughts on this?

b-studios · 2023-10-16T14:40:01Z

I know too little about string escapes, their use cases, and how they would be transported in the compiler to the different backends; so I don't have an informed opinion. The grammar you suggest makes somewhat sense but I don;t know what the v and f options are supposed to be.

dvdvgt · 2023-10-16T16:28:05Z

I know too little about string escapes, their use cases, and how they would be transported in the compiler to the different backends; so I don't have an informed opinion. The grammar you suggest makes somewhat sense but I don;t know what the v and f options are supposed to be.

They are just taken from the JS spec sheet you posted. You can read more about them here.

jiribenes · 2023-10-17T13:17:42Z

On the meeting, we talked about possible designs of escape sequences in strings and settled on similar design as Zig:
https://ziglang.org/documentation/master/#Escape-Sequences

b-studios · 2023-10-17T13:19:22Z

@jiribenes's proposal sounds good. However, we need to check that the encoding in source, Scala, and the backends somehow aligns (we need enough tests! :) )

Proposal approved by the Effekt committee (in person meeting).

dvdvgt · 2023-10-20T17:53:25Z

What do you mean by align?

The given grammar is not compatible across all backends. For example, "\u039e" is displayed as Ξ in node, yet is rejected by the MLton compiler: String constant with character too large for type: #"\u039E".

Either we somehow need to deploy some backend specific check for strings that always converts them into valid strings for the respective backend, or find the smallest subset of the given grammar supported by all backends.

Or am I perhaps misinterpreting what you are suggesting?

b-studios · 2023-10-26T09:14:35Z

No, I just meant that ideally strings would potentially work the same on all backends. But it is fine for now (TM)

b-studios · 2023-11-07T17:07:02Z

Eventually we can write a custom parser like ; to actually lex it.

change redex such that it also matches escape sequences

21912a8

dvdvgt linked an issue Oct 23, 2023 that may be closed by this pull request

Lexer doesn't recognise escaped double quotes in strings #202

Closed

b-studios approved these changes Nov 7, 2023

View reviewed changes

b-studios merged commit cf3b7d1 into master Nov 7, 2023
2 checks passed

b-studios deleted the fix/escape-quotes branch November 7, 2023 17:06

dvdvgt mentioned this pull request Nov 8, 2023

Proper String lexing #306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexer: match escape sequences in strings #290

Lexer: match escape sequences in strings #290

dvdvgt commented Oct 13, 2023 •

edited

Loading

b-studios commented Oct 13, 2023 •

edited

Loading

dvdvgt commented Oct 15, 2023

b-studios commented Oct 16, 2023

dvdvgt commented Oct 16, 2023

jiribenes commented Oct 17, 2023

b-studios commented Oct 17, 2023 •

edited

Loading

dvdvgt commented Oct 20, 2023 •

edited

Loading

b-studios commented Oct 26, 2023

b-studios commented Nov 7, 2023

Lexer: match escape sequences in strings #290

Lexer: match escape sequences in strings #290

Conversation

dvdvgt commented Oct 13, 2023 • edited Loading

b-studios commented Oct 13, 2023 • edited Loading

dvdvgt commented Oct 15, 2023

b-studios commented Oct 16, 2023

dvdvgt commented Oct 16, 2023

jiribenes commented Oct 17, 2023

b-studios commented Oct 17, 2023 • edited Loading

dvdvgt commented Oct 20, 2023 • edited Loading

b-studios commented Oct 26, 2023

b-studios commented Nov 7, 2023

dvdvgt commented Oct 13, 2023 •

edited

Loading

b-studios commented Oct 13, 2023 •

edited

Loading

b-studios commented Oct 17, 2023 •

edited

Loading

dvdvgt commented Oct 20, 2023 •

edited

Loading