Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ES2023 tokenization #9

Open
arhadthedev opened this issue Mar 24, 2024 · 0 comments
Open

Add ES2023 tokenization #9

arhadthedev opened this issue Mar 24, 2024 · 0 comments
Assignees
Labels
feature Addition of something new part: parser File-to-tree conversion spec: es2023 Conformance to https://262.ecma-international.org/14.0/

Comments

@arhadthedev
Copy link
Owner

arhadthedev commented Mar 24, 2024

For starters, it can be Tokenizer struct right inside the crate.

The struct should recognize lexical grammar described in https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar.

The struct should provide methods to switch between goal symbols on the fly. The spec section linked above gives the following reasoning for such a feature:

There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements. This requires multiple goal symbols for the lexical grammar. The InputElementRegExpOrTemplateTail goal is used in syntactic grammar contexts where a RegularExpressionLiteral, a TemplateMiddle, or a TemplateTail is permitted.

Syntactic grammar contexts is defined by a parser state so it's the parser that needs to switch a current lexical grammar goal symbol to adjust tokenization.

The user entry point is Tokenizer object with its get_next_symbol method:

/// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
///
/// > There are several situations where the identification of lexical input
/// > elements is sensitive to the syntactic grammar context that is consuming
/// > the input elements. This requires multiple goal symbols for the lexical
/// > grammar.
pub enum GoalSymbols {
    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used at the start of a Script or Module
    InputElementHashbangOrRegExp,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used in syntactic grammar contexts where a `RegularExpressionLiteral`,
    /// > a `TemplateMiddle`, or a `TemplateTail` is permitted.
    InputElementRegExpOrTemplateTail,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used in all syntactic grammar contexts where a
    /// > `RegularExpressionLiteral` is permitted but neither a
    /// > `TemplateMiddle`, nor a TemplateTail is permitted.
    InputElementRegExp,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used in all syntactic grammar contexts where a `TemplateMiddle` or a
    /// > `TemplateTail` is permitted but a `RegularExpressionLiteral` is
    /// > not permitted
    InputElementTemplateTail,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > In all other contexts, [...] used as the lexical goal symbol.
    InputElementDiv
}

#[derive(Debug)]
pub struct Tokenizer {
    /// A current set of lexical grammar
    current_goal: GoalSymbols,
}

Grammar parameter implementation

For notation like Production[Param1, Param2] add the following comment into the code:

Notes on how specification features are mapped into YACC features:

  • grammatical parameters (like Nonterminal[Param1, Param2]) are implemented as a special FecerBrowser_...KeywordUsage static semantics returning a boolean, one for each pair of a left side parameter and its right side usage
  • lookahead restriction
  • [No Line Terminator Here]
  • automatic semicolon insertion

Cover grammar

Also add a note on what cover grammar is.

Todo list:

@arhadthedev arhadthedev added feature Addition of something new spec: es2023 Conformance to https://262.ecma-international.org/14.0/ part: parser File-to-tree conversion labels Mar 24, 2024
@arhadthedev arhadthedev changed the title Add tokenization Add ES2023 tokenization Mar 24, 2024
arhadthedev added a commit that referenced this issue Mar 26, 2024
We start with the way to report syntax errors.
@arhadthedev arhadthedev self-assigned this Mar 27, 2024
arhadthedev added a commit that referenced this issue Apr 7, 2024
Now we have a global sample table of what tokens the whole lexer
(tokenizer) can process, and all tests refer it as their parameter type
for substitution.
arhadthedev added a commit that referenced this issue Apr 11, 2024
Replace `|foo| foo.1` with `|(_, bar) bar|`. No longer opaque meaning
behind numeric fields; we name these fields in place of usage.
arhadthedev added a commit that referenced this issue Jun 9, 2024
`get_next_token` tries to get the unparsed tail by chopping of a
recognized token no matter whether we got the token or a parse error. As
a result, the error yields to a panic.

This PR moves the tail extraction into processing of parse success.
Also, it adds a regression test that uses `claims` crate.
arhadthedev added a commit that referenced this issue Jun 18, 2024
Prepare the parser for exposure of `InputElementRegExp`,
`InputElementRegExpOrTemplateTail`, and `InputElementHashbangOrRegExp`
extra goal symbols by implementing the unified logic of processing them
all.

Since all goal symbols mostly share non-terminals on their right side,
we can create a enum listing all possible right-side non-terminals. It
allows us to write a per-goal logic copying into the unified enum and
slightly modify existing `extract_token` to make it a shared processor
of the unified enum.
 
For more details and the ultimate purpose, see description of
`GoalSymbols` in the parent issue.
arhadthedev added a commit that referenced this issue Jun 21, 2024
According to the ECMASccript specification, ReservedWord always follows
IdentifierName (included in CommonToken) in grammar definition
(<https://262.ecma-international.org/14.0/#sec-names-and-keywords>):

> The syntactic grammar defines Identifier as an IdentifierName that
> is not a ReservedWord.

So we make every goal symbol supporting CommonToken into supporting
ReservedWord as well.
arhadthedev added a commit that referenced this issue Jun 21, 2024
According to the ECMASccript specification, ReservedWord always follows
IdentifierName (included in CommonToken) in grammar definition
(<https://262.ecma-international.org/14.0/#sec-names-and-keywords>):

> The syntactic grammar defines Identifier as an IdentifierName that
> is not a ReservedWord.

So we make every goal symbol supporting CommonToken into supporting
ReservedWord as well.
arhadthedev added a commit that referenced this issue Jun 21, 2024
arhadthedev added a commit that referenced this issue Jun 21, 2024
arhadthedev added a commit that referenced this issue Jun 21, 2024
From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:

> There are several situations where the identification of lexical input
> elements is sensitive to the syntactic grammar context that is
> consuming the input elements. This requires multiple goal symbols
> for the lexical grammar.
arhadthedev added a commit that referenced this issue Jun 21, 2024
From
<https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:

> There are several situations where the identification of lexical input
> elements is sensitive to the syntactic grammar context that is
> consuming the input elements. This requires multiple goal symbols
> for the lexical grammar.
arhadthedev added a commit that referenced this issue Jun 22, 2024
Some rules match for a certain goal symbols and error for others. Check
this.
arhadthedev added a commit that referenced this issue Jun 22, 2024
Some rules match for a certain goal symbols and error for others. Now we
check this.
arhadthedev added a commit that referenced this issue Jun 26, 2024
For some reason, such a format is easier to read.
arhadthedev added a commit that referenced this issue Jun 26, 2024
For some reason, such a format is easier to read.
arhadthedev added a commit that referenced this issue Jun 26, 2024
It allows us to get rid of a mut-argument after caller-side cloning.
arhadthedev added a commit that referenced this issue Jun 26, 2024
It allows us to get rid of a mut-argument after caller-side cloning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Addition of something new part: parser File-to-tree conversion spec: es2023 Conformance to https://262.ecma-international.org/14.0/
Projects
None yet
Development

No branches or pull requests

1 participant