Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexer: match escape sequences in strings #290

Merged
merged 1 commit into from
Nov 7, 2023
Merged

Conversation

dvdvgt
Copy link
Collaborator

@dvdvgt dvdvgt commented Oct 13, 2023

This PR addresses and fixes issue #202.

Of course, I am open to suggestions on how to properly integrate escape sequences into the lexing process as proposed by @b-studios (#202 (comment)):

[...] we should actually design strings and escape sequences at some point.

What changes do you think this entails?

@b-studios
Copy link
Collaborator

b-studios commented Oct 13, 2023

What changes do you think this entails?

Most languages say in their spec what a String is and what an escape sequence is (for instance here for JS: https://tc39.es/ecma262/multipage/text-processing.html#prod-CharacterEscape).

So far we are always "use case driven". That is, we implement what we need to implement a specific use case; not design and implement a standard.

@dvdvgt
Copy link
Collaborator Author

dvdvgt commented Oct 15, 2023

So we would need to decide on a grammar for strings and change the lexer accordingly? For example

<string> ::= " <string2>* "
<string2> ::= <escaped> | <char>
<escaped> ::= \" | \n | \t | \r | \v | \f

Currently, a string literal is tokenized as a whole. If we want to have better errors regarding invalid control characters/escape sequences, each of these will be probably need to be its own token. However, the lexer is currently skipping all whitespaces, making it impossible to correctly tokenize a string containing such whitespaces ("a b" would be tokenized as ", a, b). Therefore, we either need to accept the updated regex for string literals or change lexing such that whitespaces are not ignored. Or am I perhaps missing something?

What are your thoughts on this?

@b-studios
Copy link
Collaborator

I know too little about string escapes, their use cases, and how they would be transported in the compiler to the different backends; so I don't have an informed opinion. The grammar you suggest makes somewhat sense but I don;t know what the v and f options are supposed to be.

@dvdvgt
Copy link
Collaborator Author

dvdvgt commented Oct 16, 2023

I know too little about string escapes, their use cases, and how they would be transported in the compiler to the different backends; so I don't have an informed opinion. The grammar you suggest makes somewhat sense but I don;t know what the v and f options are supposed to be.

They are just taken from the JS spec sheet you posted. You can read more about them here.

@jiribenes
Copy link
Contributor

On the meeting, we talked about possible designs of escape sequences in strings and settled on similar design as Zig:
https://ziglang.org/documentation/master/#Escape-Sequences

image

@b-studios
Copy link
Collaborator

b-studios commented Oct 17, 2023

@jiribenes's proposal sounds good. However, we need to check that the encoding in source, Scala, and the backends somehow aligns (we need enough tests! :) )

Proposal approved by the Effekt committee (in person meeting).

@dvdvgt
Copy link
Collaborator Author

dvdvgt commented Oct 20, 2023

What do you mean by align?

The given grammar is not compatible across all backends. For example, "\u039e" is displayed as Ξ in node, yet is rejected by the MLton compiler: String constant with character too large for type: #"\u039E".

Either we somehow need to deploy some backend specific check for strings that always converts them into valid strings for the respective backend, or find the smallest subset of the given grammar supported by all backends.

Or am I perhaps misinterpreting what you are suggesting?

@dvdvgt dvdvgt linked an issue Oct 23, 2023 that may be closed by this pull request
@b-studios
Copy link
Collaborator

No, I just meant that ideally strings would potentially work the same on all backends. But it is fine for now (TM)

@b-studios b-studios merged commit cf3b7d1 into master Nov 7, 2023
2 checks passed
@b-studios b-studios deleted the fix/escape-quotes branch November 7, 2023 17:06
@b-studios
Copy link
Collaborator

Eventually we can write a custom parser like ; to actually lex it.

@dvdvgt dvdvgt mentioned this pull request Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lexer doesn't recognise escaped double quotes in strings
3 participants